Research on Question and Answer Models in Dynamic Audio-visual Scenarios
The real world consists of a variety of modalities,and information from different modalities is interrelated and complementary.Exploring the relationships and characteristics between different modalities can effectively compensate for the limitations of individual mo-dalities.The research on dynamic audiovisual question answering(QA)models aims to use multimodal information from videos to answer questions about visual objects,sounds,and their relationships,enabling artificial intelligence to achieve scene understanding and spatio-temporal reasoning capabilities.To address the problem of imprecise audiovisual QA,a spatio-temporal question answering model is proposed.This model combines spatial fusion modelling and temporal fusion modelling to integrate multimodal features and improve the accuracy of QA.Firstly,audio,video and text features are extracted using ResNet-18,VGGish and Bi-LSTM respectively.Secondly,an early fusion approach is applied to spatially fuse the audio and video modalities based on their relationship.Then,a joint at-tention mechanism is applied to fuse the features after mutual learning to enhance their complementarity.Finally,a post-fusion attention mechanism is added to enhance the correlation between the fused features and the text.Experimental results on the MUSIC-AVQA dataset show an accuracy of 73.49%,indicating the improvement in scene understanding and spatio-temporal reasoning capabilities achieved by the proposed model.
audio-visual question and answermultimodal fusionjoint attention mechanismBi-directional Long Short-Term MemoryMUSIC-AVQA