Cross-modal Fusion Visual Question Answering Method Based on Visual-language Feature Encoding
Existing Visual Question Answering(VQA)methods use the same encoder to encode visual-language features,ignoring the differences between visual and language modalities.This introduces visual noise into the encoding of visual features that are irrelevant to the questions,resulting in insufficient attention to key visual features.To address this problem,a cross-modal fusion VQA method based on visual-language features encoding is proposed.A dynamic attention mechanism is used to encode visual features to dynamically adjust the attentional span of visual features based on the question;a guided attention mechanism with dual-gate control is designed to filter out interfering information introduced in the multimodal fusion process,thus improving the quality of multimodal features fusion and the representability of multimodal features.The accuracies on the Test-dev and Test-std datasets of the public VQA-2.0 data-set reach 71.73%and 71.94%,respectively,compared with the baseline methods,which are improved by 1.10%and 1.04%,respectively.The proposed method in this paper can improve the accuracy of answer prediction in visual question-answering tasks.