首页|Efficient bimodal emotion recognition system based on speech/text embeddings and ensemble learning fusion
Efficient bimodal emotion recognition system based on speech/text embeddings and ensemble learning fusion
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
Emotion recognition (ER) is a pivotal discipline in the field of contemporary human-machine interaction. Its primary objective is to explore and advance theories, systems, and methodologies that can effectively recognize, comprehend, and interpret human emotions. This research investigates both unimodal and bimodal strategies for ER using advanced feature embeddings for audio and text data. We leverage pretrained models such as ImageBind for speech and RoBERTa, alongside traditional TF-IDF embeddings for text, to achieve accurate recognition of emotional states. A variety of machine learning (ML) and deep learning (DL) algorithms were implemented to evaluate their performance in speaker dependent (SD) and speaker independent (SI) scenarios. Additionally, three feature fusion methods, early fusion, majority voting fusion, and stacking ensemble fusion, were employed for the bimodal emotion recognition (BER) task. Extensive numerical simulations were conducted to systematically address the complexities and challenges associated with both unimodal and bimodal ER. Our most remarkable findings demonstrate an accuracy of 86.75% in the SD scenario and 64.04% in the SI scenario on the IEMOCAP database for the proposed BER system.
Team Data Science & Artificial Intelligence, Laboratory of Mathematics, Computer Science and Applications (LMCSA), Faculty of Sciences and Technologies, Hassan Ⅱ University, Mohammedia, Morocco
Higher National School of Arts and Crafts, Hassan Ⅱ University, Casablanca, Morocco