中国科学:信息科学(英文版)2024,Vol.67Issue(12) :19-31.DOI:10.1007/s11432-024-4235-6

OCRBench:on the hidden mystery of OCR in large multimodal models

Yuliang LIU Zhang LI Mingxin HUANG Biao YANG Wenwen YU Chunyuan LI Xu-Cheng YIN Cheng-Lin LIU Lianwen JIN Xiang BAI
中国科学:信息科学(英文版)2024,Vol.67Issue(12) :19-31.DOI:10.1007/s11432-024-4235-6

OCRBench:on the hidden mystery of OCR in large multimodal models

Yuliang LIU 1Zhang LI 1Mingxin HUANG 2Biao YANG 1Wenwen YU 1Chunyuan LI 3Xu-Cheng YIN 4Cheng-Lin LIU 5Lianwen JIN 2Xiang BAI6
扫码查看

作者信息

  • 1. School of Artificial Intelligence and Automation,Huazhong University of Science and Technology,Wuhan 430074,China
  • 2. School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510641,China
  • 3. Microsoft Research,Washington 20237,USA
  • 4. School of Computer&Communication Engineering,University of Science and Technology Beijing,Beijing 100083,China
  • 5. Institute of Automation,Chinese Academy of Sciences,Beijing 101408,China
  • 6. School of Software Engineering,Huazhong University of Science and Technology,Wuhan 430074,China
  • 折叠

Abstract

Large models have recently played a dominant role in natural language processing and multi-modal vision-language learning.However,their effectiveness in text-related visual tasks remains relatively unexplored.In this paper,we conducted a comprehensive evaluation of large multimodal models,such as GPT4V and Gemini,in various text-related visual tasks including text recognition,scene text-centric visual question answering(VQA),document-oriented VQA,key information extraction(KIE),and handwritten mathematical expression recognition(HMER).To facilitate the assessment of optical character recognition(OCR)capabilities in large multimodal models,we propose OCRBench,a comprehensive evaluation bench-mark.OCRBench contains 29 datasets,making it the most comprehensive OCR evaluation benchmark available.Furthermore,our study reveals both the strengths and weaknesses of these models,particularly in handling multilingual text,handwritten text,non-semantic text,and mathematical expression recognition.Most importantly,the baseline results presented in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques.The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.

Key words

large multimodal model/OCR/text recognition/scene text-centric VQA/document-oriented VQA/key information extraction/handwritten mathematical expression recognition

引用本文复制引用

出版年

2024
中国科学:信息科学(英文版)
中国科学院

中国科学:信息科学(英文版)

CSTPCDEI
影响因子:0.715
ISSN:1674-733X
段落导航相关论文