A Hybrid Approach of Long Short Term Memory and Transformer Models for Speech Emotion Recognition

  • Tarik AbuAin College of Computing and Informatics, Saudi Electronic University, Riyadh 11673, Saudi Arabia
Keywords: Speech Recognition, Emotion Recognition, Sentiment Analysis, LSTM Model, Transformer Model

Abstract

Speech emotion recognition (SER) has become a critical component of the next generations of technologies that interact between humans and machines. However, in this paper, we explore the advantage of the hybrid LSTM + Transformer model over the solo LSTM and Transformer models. The proposed method contains the following steps: data loading using benchmark datasets such as the Toronto Emotional Speech Set (TESS), Berlin Emotional Speech Database (EMO-DB), and (SAVEE). Secondly, to create a meaningful representation to preprocess raw audio data, Mel-Frequency Cepstral Coefficients (MFCCs) are used; thirdly, the model’s architecture is designed and explained. Finally, we evaluate the precision, recall, F1 score, classification reports, and confusion matrices of the model. The outcome of this experiment based on classification reports and confusion matrices shows that the hybrid LSTM + Transformer model has a remarkable performance on the TESS-DB, surpassing the other models with a 99.64% accuracy rate, while the LSTM model gained 97.50% and the Transformer model achieved 98.21%. For the EMO-DB, the LSTM model achieved the highest accuracy of 73.83%, followed by the hybrid that gained 71.96%, and the Transformer model achieved 70.09%. Lastly, LSTM obtained the highest performance on SAVEE-DB of 65.62% accuracy, followed by the Transformer model which achieved 58.33%, and the hybrid model achieved 56.25%.
Published
2025-05-01
How to Cite
AbuAin , T. (2025). A Hybrid Approach of Long Short Term Memory and Transformer Models for Speech Emotion Recognition. Statistics, Optimization & Information Computing, 14(1), 340-351. https://doi.org/10.19139/soic-2310-5070-2521
Section
Research Articles