Efficient bimodal emotion recognition system based on speech/text embeddings and ensemble learning fusion

扫码查看

原文链接

NETL
NSTL

外文摘要：Emotion recognition (ER) is a pivotal discipline in the field of contemporary human-machine interaction. Its primary objective is to explore and advance theories, systems, and methodologies that can effectively recognize, comprehend, and interpret human emotions. This research investigates both unimodal and bimodal strategies for ER using advanced feature embeddings for audio and text data. We leverage pretrained models such as ImageBind for speech and RoBERTa, alongside traditional TF-IDF embeddings for text, to achieve accurate recognition of emotional states. A variety of machine learning (ML) and deep learning (DL) algorithms were implemented to evaluate their performance in speaker dependent (SD) and speaker independent (SI) scenarios. Additionally, three feature fusion methods, early fusion, majority voting fusion, and stacking ensemble fusion, were employed for the bimodal emotion recognition (BER) task. Extensive numerical simulations were conducted to systematically address the complexities and challenges associated with both unimodal and bimodal ER. Our most remarkable findings demonstrate an accuracy of 86.75% in the SD scenario and 64.04% in the SI scenario on the IEMOCAP database for the proposed BER system.

外文关键词：

Speech emotion recognitionText emotion recognitionBimodal emotion recognitionImageBindRoBERTaTF-IDFMachine learningDeep learning

作者：

Adil Chakhtouna、Sara Sekkate、Abdellah Adib

展开 >

作者单位：

Team Data Science & Artificial Intelligence, Laboratory of Mathematics, Computer Science and Applications (LMCSA), Faculty of Sciences and Technologies, Hassan Ⅱ University, Mohammedia, Morocco

Higher National School of Arts and Crafts, Hassan Ⅱ University, Casablanca, Morocco

出版年：

2025

DOI：

10.1007/s12243-025-01088-y

Annals of telecommunications

ISSN：0003-4347

年,卷(期)：2025.80(5/6)

参考文献量64