首页|OCRBench:on the hidden mystery of OCR in large multimodal models

OCRBench:on the hidden mystery of OCR in large multimodal models

扫码查看
OCRBench:on the hidden mystery of OCR in large multimodal models
Large models have recently played a dominant role in natural language processing and multi-modal vision-language learning.However,their effectiveness in text-related visual tasks remains relatively unexplored.In this paper,we conducted a comprehensive evaluation of large multimodal models,such as GPT4V and Gemini,in various text-related visual tasks including text recognition,scene text-centric visual question answering(VQA),document-oriented VQA,key information extraction(KIE),and handwritten mathematical expression recognition(HMER).To facilitate the assessment of optical character recognition(OCR)capabilities in large multimodal models,we propose OCRBench,a comprehensive evaluation bench-mark.OCRBench contains 29 datasets,making it the most comprehensive OCR evaluation benchmark available.Furthermore,our study reveals both the strengths and weaknesses of these models,particularly in handling multilingual text,handwritten text,non-semantic text,and mathematical expression recognition.Most importantly,the baseline results presented in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques.The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.

large multimodal modelOCRtext recognitionscene text-centric VQAdocument-oriented VQAkey information extractionhandwritten mathematical expression recognition

Yuliang LIU、Zhang LI、Mingxin HUANG、Biao YANG、Wenwen YU、Chunyuan LI、Xu-Cheng YIN、Cheng-Lin LIU、Lianwen JIN、Xiang BAI

展开 >

School of Artificial Intelligence and Automation,Huazhong University of Science and Technology,Wuhan 430074,China

School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510641,China

Microsoft Research,Washington 20237,USA

School of Computer&Communication Engineering,University of Science and Technology Beijing,Beijing 100083,China

Institute of Automation,Chinese Academy of Sciences,Beijing 101408,China

School of Software Engineering,Huazhong University of Science and Technology,Wuhan 430074,China

展开 >

large multimodal model OCR text recognition scene text-centric VQA document-oriented VQA key information extraction handwritten mathematical expression recognition

2024

中国科学:信息科学(英文版)
中国科学院

中国科学:信息科学(英文版)

CSTPCDEI
影响因子:0.715
ISSN:1674-733X
年,卷(期):2024.67(12)