A Hybrid Approach of Long Short Term Memory and Transformer Models for Speech Emotion Recognition

Tarik  AbuAin

doi:10.19139/soic-2310-5070-2521

A Hybrid Approach of Long Short Term Memory and Transformer Models for Speech Emotion Recognition

Tarik AbuAin College of Computing and Informatics, Saudi Electronic University, Riyadh 11673, Saudi Arabia

DOI: https://doi.org/10.19139/soic-2310-5070-2521

Keywords: Speech Recognition, Emotion Recognition, Sentiment Analysis, LSTM Model, Transformer Model

Abstract

Speech emotion recognition (SER) has become a critical component of the next generations of technologies that interact between humans and machines. However, in this paper, we explore the advantage of the hybrid LSTM + Transformer model over the solo LSTM and Transformer models. The proposed method contains the following steps: data loading using benchmark datasets such as the Toronto Emotional Speech Set (TESS), Berlin Emotional Speech Database (EMO-DB), and (SAVEE). Secondly, to create a meaningful representation to preprocess raw audio data, Mel-Frequency Cepstral Coefficients (MFCCs) are used; thirdly, the model’s architecture is designed and explained. Finally, we evaluate the precision, recall, F1 score, classification reports, and confusion matrices of the model. The outcome of this experiment based on classification reports and confusion matrices shows that the hybrid LSTM + Transformer model has a remarkable performance on the TESS-DB, surpassing the other models with a 99.64% accuracy rate, while the LSTM model gained 97.50% and the Transformer model achieved 98.21%. For the EMO-DB, the LSTM model achieved the highest accuracy of 73.83%, followed by the hybrid that gained 71.96%, and the Transformer model achieved 70.09%. Lastly, LSTM obtained the highest performance on SAVEE-DB of 65.62% accuracy, followed by the Transformer model which achieved 58.33%, and the hybrid model achieved 56.25%.

Published

2025-05-01

How to Cite

AbuAin , T. (2025). A Hybrid Approach of Long Short Term Memory and Transformer Models for Speech Emotion Recognition. Statistics, Optimization & Information Computing, 14(1), 340-351. https://doi.org/10.19139/soic-2310-5070-2521

Download Citation

Issue

Vol 14 No 1 (2025)

Section

Research Articles

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).