Abstract
Large models have recently played a dominant role in natural language processing and multi-modal vision-language learning.However,their effectiveness in text-related visual tasks remains relatively unexplored.In this paper,we conducted a comprehensive evaluation of large multimodal models,such as GPT4V and Gemini,in various text-related visual tasks including text recognition,scene text-centric visual question answering(VQA),document-oriented VQA,key information extraction(KIE),and handwritten mathematical expression recognition(HMER).To facilitate the assessment of optical character recognition(OCR)capabilities in large multimodal models,we propose OCRBench,a comprehensive evaluation bench-mark.OCRBench contains 29 datasets,making it the most comprehensive OCR evaluation benchmark available.Furthermore,our study reveals both the strengths and weaknesses of these models,particularly in handling multilingual text,handwritten text,non-semantic text,and mathematical expression recognition.Most importantly,the baseline results presented in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques.The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.