Review of Pre-training Methods for Visually-rich Document Understanding
Visually-rich document(VrD)refers to a document whose semantic structures are related to visual elements like type-setting formats and table structures in addition to being determined by the textual content.Numerous application scenarios,such as receipt understanding and card recognition,require automatically reading,analyzing and processing VrD(e.g.,forms,invoices,and resumes).This process is called visually-rich document understanding(VrDU),which is the cross-filed between natural lan-guage processing(NLP)and computer vision(CV).Recently,self-supervised pre-training techniques of VrDU have made signifi-cant progress in breaking down the training barriers between downstream tasks and improving model performance.However,a comprehensive summary and in-depth analysis of the pre-training models of VrDU is still lacking.To this end,we conduct an in-depth investigation and comprehensive summary of pre-training techniques of VrDU.Firstly,we introduce the data processing stage of pre-training technology,including the traditional pre-training datasets and optical character recognition(OCR)engines.Then,we discuss three key technique modules in the model pre-training stage,namely single-modality representation learning,multi-modal feature fusion,and pre-training tasks.Meanwhile,the similarities and differences between the pre-training models are elaborated on the basis of the above three modules.In addition,we briefly introduce the multi-modal large models applied in Vr-DU.Furthermore,we analyze the experimental results of pre-training models on three representative downstream tasks.Finally,the challenges and future research directions related to the pre-training models are pointed out.
Document intelligencePre-training modelsNatural language processingComputer visionDeep learning