[Objective]To effectively utilize information containing audio and video and fully capture the multi-modal interaction among text,image,and audio,this study proposes a multi-modal sentiment analysis model for online users(TIsA)incorporating text,image,and STFT-CNN audio feature extraction.[Methods]First,we separated the video data into audio and image data.Then,we used BERT and BiLSTM to obtain text feature representations and applied STFT to convert audio time-domain signals to the frequency domain.We also utilized CNN to extract audio and image features.Finally,we fused the features from the three modalities.[Results]We conducted empirical research using the"9.5 Luding Earthquake"public sentiment data from Sina Weibo.The proposed TIsA model achieved an accuracy,macro-averaged recall,and macro-averaged F1 score of 96.10%,96.20%,and 96.10%,respectively,outperforming related baseline models.[Limitations]We should have explored the more profound effects of different fusion strategies on sentiment recognition results.[Conclusions]The proposed TIsA model demonstrates high accuracy in processing audio-containing videos,effectively supporting online public opinion analysis.
Emotion RecognitionMulti-ModalDeep LearningOnline Public OpinionNetizens Sentiment