Traditional Visual Question Answering(VQA)only focuses on the visual object information in the image,ignoring the text information in the image.In addition to visual information,Text-based Visual Question Answering(TextVQA)also focuses on the text information in the image,which can answer questions more accurately and efficiently.In recent years,TextVQA has become a research focal point in the field of multimodality,and it has important application prospects in the field of scenes containing text information,such as automatic driving and scene understanding.This paper describes the concept of TextVQA and the existing problems and challenges,and makes a systematic analysis of TextVQA tasks from the aspects of methods,datasets,and future research directions.This study focuses on the analysis of the existing research methods of TextVQA,and summarizes them into three stages,namely,feature extraction,feature fusion,and answer prediction.According to the different methods used in the fusion stage,the TextVQA methods are described from three aspects:simple attention,Transformer-based,and pre-training methods.The advantages and disadvantages of different methods are summarized,and the performance of existing methods in public datasets is analyzed and compared.Four common public datasets are introduced,and their characteristics and evaluation metrics are analyzed.Finally,this paper discusses the problems and challenges facing the TextVQA task,and discusses the future research directions.
Text-based Visual Question Answering(TextVQA)text informationnatural language processingcomputer visionmultimodal fusion