Multimodal emotion recognition: integrating speech and text for improved valence, arousal, and dominance prediction

扫码查看

原文链接

NETL
NSTL

外文摘要：While speech emotion recognition has traditionally focused on classifying emotions into discrete categories like happy or angry, recent research has shifted towards a dimensional approach using the Valence-Arousal-Dominance model. This model captures the continuous emotional state. However, research in speech emotion recognition (SER) consistently shows lower performance in predicting valence compared to arousal and dominance. To improve performance, we propose a system that combines acoustic and linguistic information. This work explores a novel multimodal approach for emotion recognition that combines speech and text data. This fusion strategy aims to outperform the traditional single-modality systems. Both early and late fusion techniques are investigated in this paper. Our findings show that combining modalities in a late fusion approach enhances system performance. In this late fusion architecture, the outputs from the acoustic deep learning network and the linguistic network are fed into two stacked dense neural network (NN) layers to predict valence, arousal, and dominance as continuous values. This approach leads to a significant improvement in overall emotion recognition performance compared to prior methods.

外文关键词：

Speech emotion recognitionTextual emotion recognitionMultimodal emotion recognitionContinuous spaceFusion techniquesDeep learning methodsLSTMCNN1D

作者：

Messaoudi Awatef、Boughrara Hayet、Lachiri Zied

展开 >

作者单位：

Electrical Engineering Department, Laboratory SITI ENIT, BP. 37 Belvedere, Tunis 1002, Tunisia

出版年：

2025

DOI：

10.1007/s12243-025-01069-1

Annals of telecommunications

ISSN：0003-4347

年,卷(期)：2025.80(5/6)

参考文献量45