首页|基于Transformer的多子空间多模态情感分析

基于Transformer的多子空间多模态情感分析

扫码查看
多模态情感分析是指通过文本、视觉和声学信息识别视频中人物表达出的情感.现有方法大多通过设计复杂的融合方案学习多模态一致性信息,而忽略了模态间和模态内的差异化信息,导致缺少对多模态融合表示的信息补充.为此提出了一种基于Trans-former 的多子空间多模态情感分析(multi-subspace Transformer fusion network for multimo-dal sentiment analysis,MSTFN)方法.该方法将不同模态映射到私有和共享子空间,获得不同模态的私有表示和共享表示,学习每种模态的差异化信息和统一信息.首先,将每种模态的初始特征表示分别映射到各自的私有和共享子空间,学习每种模态中包含独特信息的私有表示与包含统一信息的共享表示.其次,在加强文本模态和音频模态作用的前提下,设计二元协同注意力跨模态Transformer模块,得到基于文本和音频的三模态表示.然后,使用模态私有表示和共享表示生成每种模态的最终表示,并两两融合得到双模态表示,以进一步补充多模态融合表示的信息.最后,将单模态表示、双模态表示和三模态表示拼接作为最终的多模态特征进行情感预测.在2个基准多模态情感分析数据集上的实验结果表明,该方法与最好的基准方法相比,在二分类准确率指标上分别提升了0.025 6/0.014 3 和 0.000 7/0.002 3.
Multi-subspace multimodal sentiment analysis method based on Transformer
Multimodal sentiment analysis refers to recognizing the emotions expressed by characters in a video through textual,visual and acoustic information.Most of the existing methods learn multimodal coherence in-formation by designing complex fusion schemes,while ignoring inter-and intra-modal differentiation informa-tion,resulting in a lack of information complementary to multimodal fusion representations.To this end,we propose a multi-subspace Transformer fusion network for multimodal sentiment analysis(MSTFN)method.The method maps different modalities to private and shared subspaces to obtain private and shared representa-tions of different modalities,learning differentiated and unified information for each modality.Specifically,the initial feature representations of each modality are first mapped to their respective private and shared subspaces to learn the private representation containing unique information and the shared representation containing uni-fied information in each modality.Second,under the premise of strengthening the roles of textual and audio modalities,a binary collaborative attention cross-modal Transformer module is designed to obtain textual and audio-based tri-modal representations.Then,the final representation of each modality is generated using mo-dal private and shared representations and fused two by two to obtain a bimodal representation to further com-plement the information of the multimodal fusion representation.Finally,the unimodal representation,bimodal representation,and trimodal representation are stitched together as the final multimodal feature for sentiment prediction.Experimental results on two benchmark multimodal sentiment analysis datasets show that the pres-ent method improves on the binary classification accuracy metrics by 0.025 6/0.014 3 and 0.000 7/0.002 3,respectively,compared to the best benchmark method.

multimodal sentiment analysisTransformer structuremultiple subspacesmulti-head attention mechanism

田昌宁、贺昱政、王笛、万波、郭栩彤

展开 >

西安电子科技大学计算机科学与技术学院,陕西西安 710071

中国电子科技集团公司第五十四研究所,河北石家庄 050081

多模态情感分析 Transformer结构 多子空间 多头注意力机制

国家科技创新2030-"新一代人工智能"重大项目中央高校基本科研业务费专项国家自然科学基金面上项目

2022ZD0117103QTZX2308462072354

2024

西北大学学报(自然科学版)
西北大学

西北大学学报(自然科学版)

CSTPCD北大核心
影响因子:0.35
ISSN:1000-274X
年,卷(期):2024.54(2)
  • 33