Improved ConvMixer and Focal Loss with Dynamic Weight for Audio-Visual Emotion Recognition
Audio-visual bimodal emotion recognition is a research hotspot in the field of emotion computing.At pres-ent,emotion recognition methods cannot simultaneously extract local and global features of video,multi-modal data fusion is simple,loss function can not pay attention to misclassification of samples in model optimization,resulting in low accura-cy of emotion recognition results.In this paper,an audio-visual emotion recognition method based on improved ConvMixer and focus loss function with dynamic weight is proposed.Spatial and temporal adjacent matrices were used instead of deep separation convolution in ConvMixer to extract global and local features in video spatial and temporal domain.A cross-modal temporal attention module is proposed to capture the temporal correlation between modals with a symmetrical struc-ture to improve the feature fusion effect.The focus loss function with dynamic weight was calculated by the confusion ma-trix,and the proportion of error samples in the loss was increased differentially to optimize the model parameters.Experi-mental results on public data sets show that the proposed method can extract representative features,optimize the network structure effectively,and improve the accuracy of emotion recognition.
emotion recognitionConvMixerattention mechanismmulti-modal feature fusionfocal loss function