The real world consists of a variety of modalities,and information from different modalities is interrelated and complementary.Exploring the relationships and characteristics between different modalities can effectively compensate for the limitations of individual mo-dalities.The research on dynamic audiovisual question answering(QA)models aims to use multimodal information from videos to answer questions about visual objects,sounds,and their relationships,enabling artificial intelligence to achieve scene understanding and spatio-temporal reasoning capabilities.To address the problem of imprecise audiovisual QA,a spatio-temporal question answering model is proposed.This model combines spatial fusion modelling and temporal fusion modelling to integrate multimodal features and improve the accuracy of QA.Firstly,audio,video and text features are extracted using ResNet-18,VGGish and Bi-LSTM respectively.Secondly,an early fusion approach is applied to spatially fuse the audio and video modalities based on their relationship.Then,a joint at-tention mechanism is applied to fuse the features after mutual learning to enhance their complementarity.Finally,a post-fusion attention mechanism is added to enhance the correlation between the fused features and the text.Experimental results on the MUSIC-AVQA dataset show an accuracy of 73.49%,indicating the improvement in scene understanding and spatio-temporal reasoning capabilities achieved by the proposed model.
关键词
视音问答/多模态融合/联合注意力机制/Bi-LSTM/MUSIC-AVQA
Key words
audio-visual question and answer/multimodal fusion/joint attention mechanism/Bi-directional Long Short-Term Memory/MUSIC-AVQA