Multimodal cascaded document layout analysis network based on Transformer
The multimodal cascaded document layout analysis network(MCOD-Net)based on Transformer was proposed in order to solve the issue of misalignment in the existing methods for pretraining objectives in both text and image modalities,which involve complex preprocessing of document images using convolutional neural network(CNN)structures leading to many model parameters.The word block alignment embedding module(WAEM)was introduced to achieve alignment embedding of the pretraining objectives for text and image modalities.Masked language modeling(MLM),masked image modeling(MIM)and word-patch alignment(WPA)were utilized for pretraining in order to enhance the model's representation learning capabilities across text and image modalities.The model structure was simplified and the number of model parameters was reduced by directly using the original document images and representing them using linear projected features of image blocks.The experimental results demonstrate that the proposed model achieves an mean average precision(mAP)of 95.1%on the publicly available PubLayNet dataset.A 2.5%overall performance improvement was achieved with outstanding generalization ability and exhibiting the best comprehensive performance compared with other models.
document layout analysisword-block alignment embeddingTransformerMCOD-Net model