Context-aware Multi-modality Interactive Network for Visual Question Answering
In recent years,visual question answering has attracted great attention.Existing methods capture high-level semantic information through intensive interaction between vision and language modalities.However,these methods consider only the relationship between words and visual regions,ignoring the context information to calcu-late the dependencies between the modalities.This paper proposes a context-aware multi-modality interactive net-work,which improves the reasoning ability of visual question answering by modeling intra-and inter-modality de-pendencies.A series of comparative experiments and ablation experiments on the large-scale benchmark VQA v2.0 shows that this method can achieve better accuracy than the latest methods on visual question answering.