基于多模态视频分类任务的模态融合策略研究

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：尽管过往人工智能相关技术在众多领域取得了成功,但是通常只是模拟了人类的某一种感知能力,也就意味着被限制在处理单个模态的信息之中.从多个模态信息中提取特征并进行有效融合对于从弱/限制领域人工智能向强/通用人工智能的发展迈进具有重要意义.本研究基于编码器-解码器结构,在视频分类任务上对多模态信息的特征编码进行早期特征融合、对各模态信息的预测结果进行后期决策融合以及对两者相结合的不同多模态信息融合策略进行了对比研究;同时对音频模态信息参与模态融合的两种方式进行了对比,即直接将音频进行特征编码进而参与模态融合或音频通过语音转文本进而以文本的形式参与模态融合.实验结果表明,将文本和音频模态单独的预测结果与另外两种模态的融合特征的预测结果进行决策融合能够进一步提高分类预测准确率;此外,通过语音识别将语音转换成文本模态信息,能够更加充分利用其中包含的语义信息.

外文标题：Modality Fusion Strategy Research Based on Multimodal Video Classification Task

外文摘要：Despite the success of AI-related technologies in many fields,they usually simulate only one type of human perception,which means that they are limited to process information from a single modality.Extracting features from multiple modal infor-mation and fusing them effectively is important for developing general AI.In this paper,a comparative study of different multimo-dal information fusion strategies based on an encoder-decoder architecture with early feature fusion for feature encoding of multi-modal information,late decision fusion for prediction results of each modal information,and a combination of both is conducted on a video classification task.This paper also compares two ways to involve audio modal information in modal fusion,i.e.,directly encoding audio with features and then participating in modal fusion or audio by speech-to-text and then participating in modal fu-sion in the form of text.Experiments show that decision fusion of the prediction results of text and audio modalities alone with those of the fused features of the other two modalities can further improve the classification prediction accuracy under the experi-mental approach of this study.Moreover,converting speech into text modal information by ASR(Automatic Speech Recognition)can make fuller use of the semantic information contained in it.

外文关键词：

MultimodalityModality fusionSpeech recognitionVideo classification

作者：

王一帆、张雪芳

展开 >

作者单位：

武汉邮电科学研究院武汉 430070

关键词：

多模态模态融合语音识别视频分类

基金：

国家重点研发计划

项目编号：

2019YFB1803600

出版年：

2024

DOI：

10.11896/jsjkx.230300212

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(z1)

参考文献量24