In response to the issue of poor performance of external knowledge-based visual question answering tasks,a framework is constructed for external knowledge-based visual question answering(VQA)models that integrated cross-modal Transformers.By intro-ducing an external knowledge base outside the VQA model,the inference ability of the VQA model on external knowledge-based tasks was improved.Further,the model utilized a bidirectional cross attention mechanism to enhance the semantic interactive and fusion abil-ity of text problems,images,and in order to optimize the problem of insufficient reasoning ability commonly found in VQA models in the face of external knowledge.The results show that compared with the baseline model LXMERT,the overall performance index of the proposed model overall improves by 15.01%on the OK VQA dataset.Meanwhile,compared with the existing latest model,the overall performance index of the proposed model overall improves by 4.46%on the OK VQA dataset.It can be seen that the proposed model improves the performance of external knowledge-based VQA tasks.