多模态语音情感识别是近年来在自然语言处理和机器学习领域备受关注的研究方向之一,不同模态的数据存在异构性和不一致性,将不同模态信息有效地融合起来并学习到高效的表示形式是一个挑战。为此,本文提出了一种新的基于时序信息建模和交叉注意力的多模态语音情感识别模型。首先采用时间卷积网络(Time Convolutional Network,TCN)提取语音、文本和视频数据的深层时序特征,使用双向门控递归单元(Bidirectional Gated Recurrent U-nit,Bi-GRU)捕捉序列数据的上下文信息,提高模型对序列数据的理解能力。然后基于交叉注意力机制和Transformer构建多模态融合网络,用于挖掘并捕获音频、文本和视觉特征之间交互的情感信息。此外,在训练过程中引入弹性网络正则化(Elastic Net Regularization)防止模型过拟合,最后完成情感识别任务。在IEMOCAP数据集上,针对快乐、悲伤、愤怒和中性四类情感的分类实验中,准确率分别为87。6%、84。1%、87。5%、71。5%,F1值分别为85。1%、84。3%、87。4%、71。4%。加权平均精度为80。75%,未加权平均精度为82。80%。结果表明,所提方法实现了较好的分类性能。
Multimodal emotion recognition based on TCN-Bi-GRU and cross attention Transformer
Multimodal speech emotion recognition is one of the research directions that has re-ceived much attention in the fields of natural language processing and machine learning in re-cent years.Different modalities of data have heterogeneity and inconsistency,and effectively integrating information from different modalities and learning efficient representation forms is a challenge.Therefore,this article proposes a new multimodal speech emotion recognition model based on temporal information modeling and cross attention.Firstly,a Time Convolu-tional Network(TCN)is used to extract deep temporal features of speech,text,and video da-ta,and a Bidirectional Gated Recurrent Unit(Bi GRU)is used to capture contextual informa-tion of sequence data,improving the model s ability to understand sequence data.Then,based on the cross attention mechanism and Transformer,a multimodal fusion network is construc-ted to mine and capture the emotional information of the interaction between audio,text,and visual features.In addition,elastic net regularization is introduced during the training process to prevent overfitting of the model and ultimately complete the emotion recognition task.In the classification experiments on the IEMOCAP dataset for four types of emotions:Happi-ness,sadness,anger,and neutrality,the accuracy rates were 87.6%,84.1%,87.5%,and 71.5%,re-spectively,and the F1 values were 85.1%,84.3%,87.4%,and 71.4%,respectively.The weighted average accuracy is 80.75%,and the unweighted average accuracy is 82.80%.The results indicate that the proposed method achieves good classification performance.