基于样本内外协同表示和自适应融合的多模态学习方法

Multimodal Learning Method Based on Intra-and Inter-Sample Cooperative Representation and Adaptive Fusion

黄学坚 ¹马廷淮 ²王根生³

扫码查看

作者信息

1. 江西财经大学虚拟现实(VR)现代产业学院南昌 330013;南京信息工程大学计算机学院南京 210044
2. 南京信息工程大学计算机学院南京 210044
3. 江西财经大学信息管理学院南昌 330013
折叠

摘要

多模态机器学习是一种新的人工智能范式,结合各种模态和智能处理算法以实现更高的性能.多模态表示和多模态融合是多模态机器学习的2个关键任务.目前,多模态表示方法很少考虑样本间的协同,导致特征表示缺乏鲁棒性,大部分多模态特征融合方法对噪声数据敏感.因此,在多模态表示方面,为了充分学习模态内和模态间的交互,提升特征表示的鲁棒性,提出一种基于样本内和样本间多模态协同的表示方法.首先,分别基于预训练的BERT,Wav2vec 2.0,Faster R-CNN提取文本特征、语音特征和视觉特征;其次,针对多模态数据的互补性和一致性,构建模态特定和模态共用2类编码器,分别学习模态特有和共享2种特征表示;然后,利用中心矩差异和正交性构建样本内协同损失函数,采用对比学习构建样本间协同损失函数;最后,基于样本内协同误差、样本间协同误差和样本重构误差设计表示学习函数.在多模态融合方面,针对每种模态可能在不同时刻表现出不同作用类型和不同级别的噪声,设计一种基于注意力机制和门控神经网络的自适应的多模态特征融合方法.在多模态意图识别数据集MIntRec和情感数据集CMU-MOSI,CMU-MOSEI上的实验结果表明,该多模态学习方法在多个评价指标上优于基线方法.

Abstract

Multimodal machine learning represents a novel paradigm in artificial intelligence,leveraging various modalities and intelligent processing algorithms to achieve enhanced performance.Multimodal representation and fusion are two pivotal tasks in multimodal machine learning.Currently,most multimodal representation methods pay little attention to inter-sample collaboration,leading to a lack of robustness in feature representation.Additionally,most multimodal feature fusion methods exhibit sensitivity to noisy data.Therefore,in the realm of multimodal representation,an approach based on both intra-sample and inter-sample multimodal collaboration is proposed to facilitate a comprehensive understanding of interactions within and between modalities,ultimately enhancing the robustness of feature representation.Firstly,text,speech,and visual features are individually extracted based on pre-trained models such as BERT,Wav2vec 2.0,and Faster R-CNN.Subsequently,considering the complementarity and consistency of multimodal data,two categories of encoders,namely modality-specific and modality-shared,are constructed to learn both modality-specific and shared feature representations.Furthermore,intra-sample collaboration loss functions are formulated using central moment differences and orthogonality,while inter-sample collaboration loss functions are established using contrastive learning.Lastly,a representation learning function is designed based on intra-sample collaboration,inter-sample collaboration,and sample reconstruction errors.Regarding multimodal fusion,an adaptive multimodal feature fusion method is designed,accounting for the possibility that each modality may exhibit varying types of effects and levels of noise at different times,using attention mechanisms and gated neural networks.Experimental results on the multimodal intent recognition dataset MIntRec and emotion datasets CMU-MOSI and CMU-MOSEI demonstrate that this multimodal learning approach outperforms baseline methods across multiple evaluation metrics.

关键词

多模态表示/多模态融合/多模态学习/协同表示/自适应融合

Key words

multimodal representation/multimodal fusion/multimodal learning/collaborative representation/adaptive fusion

引用本文复制引用

基金项目

国家自然科学基金(62372243)

国家自然科学基金(72061015)

国家自然科学基金(62102187)

出版年

2024

计算机研究与发展

中国科学院计算技术研究所中国计算机学会

计算机研究与发展

CSTPCDCSCD北大核心

影响因子：2.649

ISSN：1000-1239

参考文献量35

段落导航