首页|动态视音场景下问答模型研究

动态视音场景下问答模型研究

扫码查看
现实世界由大量不同模态内容构建而成,各种模态的信息相互关联和互补,充分挖掘不同模态之间的关系和特性能够有效弥补单一模态信息的局限性。动态视音场景下的问答模型研究,旨在通过视频中多模态信息回答不同视觉物体、声音及其相互联系的问题,使人工智能获得场景感知和时空推理能力。针对视音问答不准确的问题,提出了一种空间时序问答模型,该模型通过空间融合建模和时序融合建模对多模态特征进行融合,从而提高问答准确率。首先,分别使用Resnet_18,VGGish和Bi-LST对音频、视频和文字进行特征提取;其次,根据声音和视频的关系,在特征融合时对声音和视频两种模态进行早期的空间融合,并使用联合注意力机制在相互辅助学习后进行特征融合,增强特征互补性;最后,在特征融合后添加注意力机制以增强融合特征与文字的相关性。基于MUSIC-AVQA数据集的实验准确率达73。49%,实现了场景感知和时空推理能力的提升。
Research on Question and Answer Models in Dynamic Audio-visual Scenarios
The real world consists of a variety of modalities,and information from different modalities is interrelated and complementary.Exploring the relationships and characteristics between different modalities can effectively compensate for the limitations of individual mo-dalities.The research on dynamic audiovisual question answering(QA)models aims to use multimodal information from videos to answer questions about visual objects,sounds,and their relationships,enabling artificial intelligence to achieve scene understanding and spatio-temporal reasoning capabilities.To address the problem of imprecise audiovisual QA,a spatio-temporal question answering model is proposed.This model combines spatial fusion modelling and temporal fusion modelling to integrate multimodal features and improve the accuracy of QA.Firstly,audio,video and text features are extracted using ResNet-18,VGGish and Bi-LSTM respectively.Secondly,an early fusion approach is applied to spatially fuse the audio and video modalities based on their relationship.Then,a joint at-tention mechanism is applied to fuse the features after mutual learning to enhance their complementarity.Finally,a post-fusion attention mechanism is added to enhance the correlation between the fused features and the text.Experimental results on the MUSIC-AVQA dataset show an accuracy of 73.49%,indicating the improvement in scene understanding and spatio-temporal reasoning capabilities achieved by the proposed model.

audio-visual question and answermultimodal fusionjoint attention mechanismBi-directional Long Short-Term MemoryMUSIC-AVQA

段毛毛、连培榆、史海涛

展开 >

中国石油大学(北京)克拉玛依校区 石油学院,新疆 克拉玛依 834000

视音问答 多模态融合 联合注意力机制 Bi-LSTM MUSIC-AVQA

克拉玛依市创新人才专项

XQZX20220047

2024

计算机技术与发展
陕西省计算机学会

计算机技术与发展

CSTPCD
影响因子:0.621
ISSN:1673-629X
年,卷(期):2024.34(3)
  • 28