基于文图音融合的多模态情感识别研究

Multi-Modal Sentiment Recognition of Online Users Based on Text-Image-Audio Fusion

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：[目的]为有效利用含音视频的信息,充分捕捉文本、图像、音频之间的交互作用,提出基于STFT-CNN的音频特征提取方法与融合文图音的多模态网民情感识别模型TIsA.[方法]首先,将视频数据拆分为音频数据和图像数据;其次,利用BERT和BiLSTM获取文本特征表示,通过STFT将音频时域信号频域化并采用CNN提取音频特征和图像特征;最后,将三种模态特征进行融合.[结果]采集新浪微博平台"9·5四川泸定地震"舆情数据进行实证,本文提出的TIsA模型的准确率、宏平均召回率和宏平均F1值分别达到96.10％、96.20％和96.10％,较相关基线模型效果更优.[局限]未探究不同融合策略对情感识别结果的深层影响.[结论]本文提出的网民情感识别模型在处理含音频视频的多模态信息时表现出较高准确率,能够更好地判断网民情感,为网络舆情分析提供有效支撑.

外文摘要：[Objective]To effectively utilize information containing audio and video and fully capture the multi-modal interaction among text,image,and audio,this study proposes a multi-modal sentiment analysis model for online users(TIsA)incorporating text,image,and STFT-CNN audio feature extraction.[Methods]First,we separated the video data into audio and image data.Then,we used BERT and BiLSTM to obtain text feature representations and applied STFT to convert audio time-domain signals to the frequency domain.We also utilized CNN to extract audio and image features.Finally,we fused the features from the three modalities.[Results]We conducted empirical research using the"9.5 Luding Earthquake"public sentiment data from Sina Weibo.The proposed TIsA model achieved an accuracy,macro-averaged recall,and macro-averaged F1 score of 96.10％,96.20％,and 96.10％,respectively,outperforming related baseline models.[Limitations]We should have explored the more profound effects of different fusion strategies on sentiment recognition results.[Conclusions]The proposed TIsA model demonstrates high accuracy in processing audio-containing videos,effectively supporting online public opinion analysis.

外文关键词：

Emotion RecognitionMulti-ModalDeep LearningOnline Public OpinionNetizens Sentiment

作者：

李慧、庞经纬

展开 >

作者单位：

西安电子科技大学经济与管理学院西安 710126

关键词：

情感识别多模态深度学习网络舆情网民情感

出版年：

2024

DOI：

10.11925/infotech.2096-3467.2023.0744

数据分析与知识发现

中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI

影响因子：1.452

ISSN：2096-3467

年,卷(期)：2024.8(11)