首页|视觉富文档理解预训练综述

视觉富文档理解预训练综述

扫码查看
视觉富文档指语义结构不仅由文本内容决定,还与排版格式和表格结构等视觉元素相关的文档.现实生活中的票据理解和证件识别等应用场景,都需要对视觉富文档进行自动化的阅读、分析和处理.这一过程即为视觉富文档理解,属于自然语言处理和计算机视觉的交叉领域.近年来,视觉富文档理解领域的预训练技术在打破下游任务的训练壁垒和提升模型表现上取得了重大的进展.然而,目前对现有的预训练模型的归纳总结和深入分析仍然有所欠缺.为此,对视觉富文档理解领域预训练技术的相关研究进行了全面总结.首先,介绍了预训练技术的数据预处理阶段,包括预训练数据集和光学字符识别引擎.然后,对预训练技术的模型预训练阶段进行了阐述,提炼出单模态表示学习、多模态特征融合和预训练任务3个关键的技术模块,并基于上述模块归纳了预训练模型之间的共性和差异.此外,简要介绍了多模态大模型在视觉富文档理解领域的应用.接着,对预训练模型在下游任务上的表现进行了对比分析.最后,探讨了预训练技术面临的挑战和未来的研究方向.
Review of Pre-training Methods for Visually-rich Document Understanding
Visually-rich document(VrD)refers to a document whose semantic structures are related to visual elements like type-setting formats and table structures in addition to being determined by the textual content.Numerous application scenarios,such as receipt understanding and card recognition,require automatically reading,analyzing and processing VrD(e.g.,forms,invoices,and resumes).This process is called visually-rich document understanding(VrDU),which is the cross-filed between natural lan-guage processing(NLP)and computer vision(CV).Recently,self-supervised pre-training techniques of VrDU have made signifi-cant progress in breaking down the training barriers between downstream tasks and improving model performance.However,a comprehensive summary and in-depth analysis of the pre-training models of VrDU is still lacking.To this end,we conduct an in-depth investigation and comprehensive summary of pre-training techniques of VrDU.Firstly,we introduce the data processing stage of pre-training technology,including the traditional pre-training datasets and optical character recognition(OCR)engines.Then,we discuss three key technique modules in the model pre-training stage,namely single-modality representation learning,multi-modal feature fusion,and pre-training tasks.Meanwhile,the similarities and differences between the pre-training models are elaborated on the basis of the above three modules.In addition,we briefly introduce the multi-modal large models applied in Vr-DU.Furthermore,we analyze the experimental results of pre-training models on three representative downstream tasks.Finally,the challenges and future research directions related to the pre-training models are pointed out.

Document intelligencePre-training modelsNatural language processingComputer visionDeep learning

张剑、李晖、张晟铭、吴杰、彭滢

展开 >

西安电子科技大学网络与信息安全学院 西安 710071

中电科网络安全科技股份有限公司 成都 610095

文档智能 预训练模型 自然语言处理 计算机视觉 深度学习

2025

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2025.52(1)