Cross-modal Information Filtering-based Networks for Visual Question Answering
As a multi-modal task,the bottleneck of visual question answering(VQA)is to solve the problem of fusion between different modes.It requires not only a full understanding of vision and text in the image,but also the ability to align cross-modal representation.The introduction of the attention mechanism provides an effective path for multi-mode fusion.However,the pre-vious methods usually calculate the extracted image features directly,ignoring the noise and incorrect information contained in the image features,and most of the methods are limited to the shallow interaction between modes,without considering the deep se-mantic information between modes.To solve this problem,a cross-modal information filtering network(CIFN)is proposed.First-ly,the feature of the problem is taken as the supervision signal,and the information filtering module is designed to filter the fea-ture information of the image,so that it can better fit the representation of the problem.Then the image features and problem fea-tures are sent to the cross-modal interaction layer,and the intra-modal and inter-modal relationships are modeled respectively un-der the action of self-attention and guided attention,so as to obtain more fine-grained multi-modal features.Extensive experiments have been conducted on VQA2.0 data sets,and the experimental results show that the introduction of information filtering mo-dule effectively improves the model accuracy,and the overall accuracy of test-std reaches 71.51%,which has good performance compared with the most advanced methods.