首页|Multimodal emotion recognition: integrating speech and text for improved valence, arousal, and dominance prediction
Multimodal emotion recognition: integrating speech and text for improved valence, arousal, and dominance prediction
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
While speech emotion recognition has traditionally focused on classifying emotions into discrete categories like happy or angry, recent research has shifted towards a dimensional approach using the Valence-Arousal-Dominance model. This model captures the continuous emotional state. However, research in speech emotion recognition (SER) consistently shows lower performance in predicting valence compared to arousal and dominance. To improve performance, we propose a system that combines acoustic and linguistic information. This work explores a novel multimodal approach for emotion recognition that combines speech and text data. This fusion strategy aims to outperform the traditional single-modality systems. Both early and late fusion techniques are investigated in this paper. Our findings show that combining modalities in a late fusion approach enhances system performance. In this late fusion architecture, the outputs from the acoustic deep learning network and the linguistic network are fed into two stacked dense neural network (NN) layers to predict valence, arousal, and dominance as continuous values. This approach leads to a significant improvement in overall emotion recognition performance compared to prior methods.