Research on Visual Question Answering Fusing Visual Grounding Informa-tion
To enhance the capture of relevant information in images by a Visual Question An-swering(VQA)model,Visual Grounding(VG)information is introduced to augment the model's understanding of the complete image context.This involves integrating semantic features from the image and shallow textual features into an image-based text encoder,mapping textual fea-tures to the image space.Subsequently,the obtained textual features and image features are fed into a text-based image decoder to generate VG information.Experimental results demonstrate that the model achieves the best performance across four evaluation metrics:Accuracy,Open,Bi-nary,and Consistency,with improvements of 0.84%,0.74%,3.38%,and 2.95%respective-ly.Specifically,Accuracy reaches 56.94%,indicating that VG information effectively enhances the proportion of information related to the question in the image features.