视觉富文档理解预训练综述

Review of Pre-training Methods for Visually-rich Document Understanding

张剑 ¹李晖 ²张晟铭 ³吴杰 ³彭滢³

扫码查看

作者信息

1. 西安电子科技大学网络与信息安全学院西安 710071;中电科网络安全科技股份有限公司成都 610095
2. 西安电子科技大学网络与信息安全学院西安 710071
3. 中电科网络安全科技股份有限公司成都 610095
折叠

摘要

视觉富文档指语义结构不仅由文本内容决定,还与排版格式和表格结构等视觉元素相关的文档.现实生活中的票据理解和证件识别等应用场景,都需要对视觉富文档进行自动化的阅读、分析和处理.这一过程即为视觉富文档理解,属于自然语言处理和计算机视觉的交叉领域.近年来,视觉富文档理解领域的预训练技术在打破下游任务的训练壁垒和提升模型表现上取得了重大的进展.然而,目前对现有的预训练模型的归纳总结和深入分析仍然有所欠缺.为此,对视觉富文档理解领域预训练技术的相关研究进行了全面总结.首先,介绍了预训练技术的数据预处理阶段,包括预训练数据集和光学字符识别引擎.然后,对预训练技术的模型预训练阶段进行了阐述,提炼出单模态表示学习、多模态特征融合和预训练任务3个关键的技术模块,并基于上述模块归纳了预训练模型之间的共性和差异.此外,简要介绍了多模态大模型在视觉富文档理解领域的应用.接着,对预训练模型在下游任务上的表现进行了对比分析.最后,探讨了预训练技术面临的挑战和未来的研究方向.

Abstract

Visually-rich document(VrD)refers to a document whose semantic structures are related to visual elements like type-setting formats and table structures in addition to being determined by the textual content.Numerous application scenarios,such as receipt understanding and card recognition,require automatically reading,analyzing and processing VrD(e.g.,forms,invoices,and resumes).This process is called visually-rich document understanding(VrDU),which is the cross-filed between natural lan-guage processing(NLP)and computer vision(CV).Recently,self-supervised pre-training techniques of VrDU have made signifi-cant progress in breaking down the training barriers between downstream tasks and improving model performance.However,a comprehensive summary and in-depth analysis of the pre-training models of VrDU is still lacking.To this end,we conduct an in-depth investigation and comprehensive summary of pre-training techniques of VrDU.Firstly,we introduce the data processing stage of pre-training technology,including the traditional pre-training datasets and optical character recognition(OCR)engines.Then,we discuss three key technique modules in the model pre-training stage,namely single-modality representation learning,multi-modal feature fusion,and pre-training tasks.Meanwhile,the similarities and differences between the pre-training models are elaborated on the basis of the above three modules.In addition,we briefly introduce the multi-modal large models applied in Vr-DU.Furthermore,we analyze the experimental results of pre-training models on three representative downstream tasks.Finally,the challenges and future research directions related to the pre-training models are pointed out.

关键词

文档智能/预训练模型/自然语言处理/计算机视觉/深度学习

Key words

Document intelligence/Pre-training models/Natural language processing/Computer vision/Deep learning

引用本文复制引用

出版年

2025

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

北大核心

影响因子：0.944

ISSN：1002-137X

段落导航