Audio-visual bimodal emotion recognition is a research hotspot in the field of emotion computing.At pres-ent,emotion recognition methods cannot simultaneously extract local and global features of video,multi-modal data fusion is simple,loss function can not pay attention to misclassification of samples in model optimization,resulting in low accura-cy of emotion recognition results.In this paper,an audio-visual emotion recognition method based on improved ConvMixer and focus loss function with dynamic weight is proposed.Spatial and temporal adjacent matrices were used instead of deep separation convolution in ConvMixer to extract global and local features in video spatial and temporal domain.A cross-modal temporal attention module is proposed to capture the temporal correlation between modals with a symmetrical struc-ture to improve the feature fusion effect.The focus loss function with dynamic weight was calculated by the confusion ma-trix,and the proportion of error samples in the loss was increased differentially to optimize the model parameters.Experi-mental results on public data sets show that the proposed method can extract representative features,optimize the network structure effectively,and improve the accuracy of emotion recognition.
关键词
情感识别/ConvMixer/注意力机制/多模态特征融合/焦点损失函数
Key words
emotion recognition/ConvMixer/attention mechanism/multi-modal feature fusion/focal loss function