首页|基于Transformer的多模态级联文档布局分析网络

基于Transformer的多模态级联文档布局分析网络

扫码查看
针对现有方法在文本和图像模态的预训练目标上存在嵌入不对齐,文档图像采用基于卷积神经网络(CNN)的结构进行预处理,流程复杂,模型参数量大的问题,提出基于Transformer的多模态级联文档布局分析网络(MCOD-Net)。设计词块对齐嵌入模块(WAEM),实现文本和图像模态预训练目标的对齐嵌入,使用掩码语言建模(MLM)、掩码图像建模(MIM)和词块对齐(WPA)进行预训练,以促进模型在文本和图像模态上的表征学习能力。直接使用文档原始图像,用图像块的线性投影特征来表示文档图像,简化模型结构,减小了模型参数量。实验结果表明,所提模型在PubLayNet公开数据集上的平均精度均值(mAP)达到95。1%。相较于其他模型,整体性能提升了 2。5%,泛化能力突出,综合效果最优。
Multimodal cascaded document layout analysis network based on Transformer
The multimodal cascaded document layout analysis network(MCOD-Net)based on Transformer was proposed in order to solve the issue of misalignment in the existing methods for pretraining objectives in both text and image modalities,which involve complex preprocessing of document images using convolutional neural network(CNN)structures leading to many model parameters.The word block alignment embedding module(WAEM)was introduced to achieve alignment embedding of the pretraining objectives for text and image modalities.Masked language modeling(MLM),masked image modeling(MIM)and word-patch alignment(WPA)were utilized for pretraining in order to enhance the model's representation learning capabilities across text and image modalities.The model structure was simplified and the number of model parameters was reduced by directly using the original document images and representing them using linear projected features of image blocks.The experimental results demonstrate that the proposed model achieves an mean average precision(mAP)of 95.1%on the publicly available PubLayNet dataset.A 2.5%overall performance improvement was achieved with outstanding generalization ability and exhibiting the best comprehensive performance compared with other models.

document layout analysisword-block alignment embeddingTransformerMCOD-Net model

温绍杰、吴瑞刚、冯超文、刘英莉

展开 >

昆明理工大学信息工程与自动化学院,云南昆明 650500

昆明理工大学云南省计算机技术应用重点实验室,云南昆明 650500

文档布局分析 词块对齐嵌入 Transformer MCOD-Net模型

国家自然科学基金国家自然科学基金云南计算机技术应用重点实验室开放基金云南省科技重大专项

52061020619712082020103202302AG050009

2024

浙江大学学报(工学版)
浙江大学

浙江大学学报(工学版)

CSTPCD北大核心
影响因子:0.625
ISSN:1008-973X
年,卷(期):2024.58(2)
  • 25