首页|Efficient bimodal emotion recognition system based on speech/text embeddings and ensemble learning fusion

Efficient bimodal emotion recognition system based on speech/text embeddings and ensemble learning fusion

扫码查看
Emotion recognition (ER) is a pivotal discipline in the field of contemporary human-machine interaction. Its primary objective is to explore and advance theories, systems, and methodologies that can effectively recognize, comprehend, and interpret human emotions. This research investigates both unimodal and bimodal strategies for ER using advanced feature embeddings for audio and text data. We leverage pretrained models such as ImageBind for speech and RoBERTa, alongside traditional TF-IDF embeddings for text, to achieve accurate recognition of emotional states. A variety of machine learning (ML) and deep learning (DL) algorithms were implemented to evaluate their performance in speaker dependent (SD) and speaker independent (SI) scenarios. Additionally, three feature fusion methods, early fusion, majority voting fusion, and stacking ensemble fusion, were employed for the bimodal emotion recognition (BER) task. Extensive numerical simulations were conducted to systematically address the complexities and challenges associated with both unimodal and bimodal ER. Our most remarkable findings demonstrate an accuracy of 86.75% in the SD scenario and 64.04% in the SI scenario on the IEMOCAP database for the proposed BER system.

Speech emotion recognitionText emotion recognitionBimodal emotion recognitionImageBindRoBERTaTF-IDFMachine learningDeep learning

Adil Chakhtouna、Sara Sekkate、Abdellah Adib

展开 >

Team Data Science & Artificial Intelligence, Laboratory of Mathematics, Computer Science and Applications (LMCSA), Faculty of Sciences and Technologies, Hassan Ⅱ University, Mohammedia, Morocco

Higher National School of Arts and Crafts, Hassan Ⅱ University, Casablanca, Morocco

2025

Annals of telecommunications

Annals of telecommunications

ISSN:0003-4347
年,卷(期):2025.80(5/6)
  • 64