期刊,International journal on document analysis and recognition: IJDAR 2025年28卷2期_国家学术搜索

Robust page object detection network for heterogeneous document images

Hadia Showkat KawoosaMuhammad Suhaib KanrooKapil RanaPuneet Goyal...

143-159页

查看更多>>摘要：Document Layout Analysis (DLA) has emerged as a challenging problem in the field of computer vision. The primary goal of DLA involves the identification of page objects including tables, figures, images, and equations from document images. In this paper, we propose a Lightweight and Robust Page Object Detection Network (LR-PODNet) for page object detection (POD) from heterogeneous document images. The proposed network improves the object detection capabilities of the YOLOv5 model by integrating the two components: Convolutional Global Attention Block (C3-AB) and Hybrid Dilated Atrous spatial pyramid pooling Block (HDAB) for POD. The C3-AB is an enhanced version of the C3 module of YOLOv5 which incorporates a global attention block instead of bottleneck-CSP block. It enhances the capability of the model to capture global dimensional features and suppresses the redundant content. The output from C3-AB is passed to the HDAB for extraction of both local and contextual features. The HDAB is strategically incorporated instead of SPPF within the YOLOv5 architecture to enhance multiple feature extraction capabilities. The experimental results show that the proposed LR-PODNet outperforms the existing methods by achieving the mAP@0.5:0.95 of 77.5% and 76.2% on the IIIT-AR-13K and NCERT5K-IITRPR datasets, respectively. Additionally, we have also evaluated the robustness of the proposed model on these two datasets by varying the IoU threshold.

原文链接:

NETL
NSTL
Springer Nature

Robust page object detection network for heterogeneous document images

Hadia Showkat KawoosaMuhammad Suhaib KanrooKapil RanaPuneet Goyal...

143-159页

查看更多>>摘要：Document Layout Analysis (DLA) has emerged as a challenging problem in the field of computer vision. The primary goal of DLA involves the identification of page objects including tables, figures, images, and equations from document images. In this paper, we propose a Lightweight and Robust Page Object Detection Network (LR-PODNet) for page object detection (POD) from heterogeneous document images. The proposed network improves the object detection capabilities of the YOLOv5 model by integrating the two components: Convolutional Global Attention Block (C3-AB) and Hybrid Dilated Atrous spatial pyramid pooling Block (HDAB) for POD. The C3-AB is an enhanced version of the C3 module of YOLOv5 which incorporates a global attention block instead of bottleneck-CSP block. It enhances the capability of the model to capture global dimensional features and suppresses the redundant content. The output from C3-AB is passed to the HDAB for extraction of both local and contextual features. The HDAB is strategically incorporated instead of SPPF within the YOLOv5 architecture to enhance multiple feature extraction capabilities. The experimental results show that the proposed LR-PODNet outperforms the existing methods by achieving the mAP@0.5:0.95 of 77.5% and 76.2% on the IIIT-AR-13K and NCERT5K-IITRPR datasets, respectively. Additionally, we have also evaluated the robustness of the proposed model on these two datasets by varying the IoU threshold.

原文链接:

NETL
NSTL
Springer Nature

In-domain versus out-of-domain transfer learning for document layout analysis

Axel De NardinSilvia ZottinClaudio PiciarelliGian Luca Foresti...

161-175页

查看更多>>摘要：Data availability is a big concern in the field of document analysis, especially when working on tasks that require a high degree of precision when it comes to the definition of the ground truths on which to train deep learning models. A notable example is represented by the task of document layout analysis in handwritten documents, which requires pixel-precise segmentation maps to highlight the different layout components of each document page. These segmentation maps are typically very time-consuming and require a high degree of domain knowledge to be defined, as they are intrinsically characterized by the content of the text. For this reason in the present work, we explore the effects of different initialization strategies for deep learning models employed for this type of task by relying on both in-domain and cross-domain datasets for their pre-training. To test the employed models we use two publicly available datasets with heterogeneous characteristics both regarding their structure as well as the languages of the contained documents. We show how a combination of cross-domain and in-domain transfer learning approaches leads to the best overall performance of the models, as well as speeding up their convergence process.

原文链接:

NETL
NSTL
Springer Nature

In-domain versus out-of-domain transfer learning for document layout analysis

Axel De NardinSilvia ZottinClaudio PiciarelliGian Luca Foresti...

161-175页

查看更多>>摘要：Data availability is a big concern in the field of document analysis, especially when working on tasks that require a high degree of precision when it comes to the definition of the ground truths on which to train deep learning models. A notable example is represented by the task of document layout analysis in handwritten documents, which requires pixel-precise segmentation maps to highlight the different layout components of each document page. These segmentation maps are typically very time-consuming and require a high degree of domain knowledge to be defined, as they are intrinsically characterized by the content of the text. For this reason in the present work, we explore the effects of different initialization strategies for deep learning models employed for this type of task by relying on both in-domain and cross-domain datasets for their pre-training. To test the employed models we use two publicly available datasets with heterogeneous characteristics both regarding their structure as well as the languages of the contained documents. We show how a combination of cross-domain and in-domain transfer learning approaches leads to the best overall performance of the models, as well as speeding up their convergence process.

原文链接:

NETL
NSTL
Springer Nature

Subword recognition in historical Arabic manuscripts using handcrafted features and deep learning approaches

Mohamed DahbaliNoureddine AboutabitNidal Lamghari

177-193页

查看更多>>摘要：Recent years have seen significant endeavors to improve handwriting recognition systems and digitize historical manuscripts. Nevertheless, recognizing historical Arabic manuscripts remains a considerable challenge. The purpose of this study is to investigate subword recognition in historical Arabic manuscripts. Two systems are established. The first system involves using a variety of handcrafted feature methods with diverse machine learning algorithms. The second system uses a deep learning architecture that integrates convolutional neural network and bidirectional long short-term memory based on a character model approach with connectionist temporal classification as a decoder. By utilizing the IBN SINA dataset, the histogram of oriented gradients descriptor demonstrated superior performance in the first system, while the second system achieved notable results. The findings of this study provide a framework for the development of historical manuscript recognition systems.

原文链接:

NETL
NSTL
Springer Nature

Subword recognition in historical Arabic manuscripts using handcrafted features and deep learning approaches

Mohamed DahbaliNoureddine AboutabitNidal Lamghari

177-193页

查看更多>>摘要：Recent years have seen significant endeavors to improve handwriting recognition systems and digitize historical manuscripts. Nevertheless, recognizing historical Arabic manuscripts remains a considerable challenge. The purpose of this study is to investigate subword recognition in historical Arabic manuscripts. Two systems are established. The first system involves using a variety of handcrafted feature methods with diverse machine learning algorithms. The second system uses a deep learning architecture that integrates convolutional neural network and bidirectional long short-term memory based on a character model approach with connectionist temporal classification as a decoder. By utilizing the IBN SINA dataset, the histogram of oriented gradients descriptor demonstrated superior performance in the first system, while the second system achieved notable results. The findings of this study provide a framework for the development of historical manuscript recognition systems.

原文链接:

NETL
NSTL
Springer Nature

Efficient title text detection using multi-loss

Shitala PrasadAnuj Abraham

195-205页

查看更多>>摘要：YouTube's "Video Chapter" feature segments videos into different sections, marked by timestamps on the slider, enhancing user navigation. Given the vast volume of video data, processing these efficiently demands substantial time and computational resources. This paper addresses two key objectives: reducing the computational cost of deep model training for text detection and enhancing overall performance with minimal effort. We introduce a classroom-based multi-loss learning approach for text detection, extending its application to title detection without requiring annotations. In deep learning, loss functions play a crucial role in updating model weights. Our proposed multi-loss functions facilitate faster convergence compared to baseline methods. Additionally, we present a novel technique to handle annotation-less data by employing a text grouping method to differentiate between regular text and title text. Experimental results on the COCO-Text and Slidin' Videos AI-5G Challenge datasets demonstrate the efficacy and practicality of our approach.

原文链接:

NETL
NSTL
Springer Nature

Efficient title text detection using multi-loss

Shitala PrasadAnuj Abraham

195-205页

查看更多>>摘要：YouTube's "Video Chapter" feature segments videos into different sections, marked by timestamps on the slider, enhancing user navigation. Given the vast volume of video data, processing these efficiently demands substantial time and computational resources. This paper addresses two key objectives: reducing the computational cost of deep model training for text detection and enhancing overall performance with minimal effort. We introduce a classroom-based multi-loss learning approach for text detection, extending its application to title detection without requiring annotations. In deep learning, loss functions play a crucial role in updating model weights. Our proposed multi-loss functions facilitate faster convergence compared to baseline methods. Additionally, we present a novel technique to handle annotation-less data by employing a text grouping method to differentiate between regular text and title text. Experimental results on the COCO-Text and Slidin' Videos AI-5G Challenge datasets demonstrate the efficacy and practicality of our approach.

原文链接:

NETL
NSTL
Springer Nature

Unpaired document image denoising for OCR using BiLSTM enhanced CycleGAN

Katyani SinghGanesh TataEric Van OeverenNilanjan Ray...

207-224页

查看更多>>摘要：The recognition performance of optical character recognition (OCR) models can be sub-optimal when document images suffer from various degradations. Supervised deep learning methods for image enhancement can generate high-quality enhanced images. However, these methods demand the availability of corresponding clean images or ground truth text. Sometimes this requirement is difficult to fulfill for real-world noisy documents. For instance, it can be challenging to create paired noisy/clean training datasets or obtain ground truth text for noisy point-of-sale receipts and invoices. Unsupervised methods have been explored in recent years to enhance images in the absence of ground truth images or text. However, these methods focus on enhancing natural scene images. In the case of document images, preserving the readability of text in the enhanced images is of utmost importance for improved OCR performance. In this work, we propose a modified architecture to the CycleGAN model to improve its performance in enhancing document images with better text preservation. Inspired by the success of CNN-BiLSTM combination networks in text recognition models, we propose modifying the discriminator network in the CycleGAN model to a combined CNN-BiLSTM network for better feature extraction from document images during classification by the discriminator network. The results demonstrate that the proposed model significantly enhances text preservation and OCR performance compared to the standard CycleGAN discriminator network. Specifically, when assessing the Tesseract engine's word accuracy on real-world noisy receipt images from the POS dataset, the proposed model achieved an improvement of up to 61.66% over the original CycleGAN model and 23.32% over the original noisy receipt images. Additionally, the proposed model consistently outperformed other unsupervised classical techniques across all OCR engines evaluated.

原文链接:

NETL
NSTL
Springer Nature

Unpaired document image denoising for OCR using BiLSTM enhanced CycleGAN

Katyani SinghGanesh TataEric Van OeverenNilanjan Ray...

207-224页

查看更多>>摘要：The recognition performance of optical character recognition (OCR) models can be sub-optimal when document images suffer from various degradations. Supervised deep learning methods for image enhancement can generate high-quality enhanced images. However, these methods demand the availability of corresponding clean images or ground truth text. Sometimes this requirement is difficult to fulfill for real-world noisy documents. For instance, it can be challenging to create paired noisy/clean training datasets or obtain ground truth text for noisy point-of-sale receipts and invoices. Unsupervised methods have been explored in recent years to enhance images in the absence of ground truth images or text. However, these methods focus on enhancing natural scene images. In the case of document images, preserving the readability of text in the enhanced images is of utmost importance for improved OCR performance. In this work, we propose a modified architecture to the CycleGAN model to improve its performance in enhancing document images with better text preservation. Inspired by the success of CNN-BiLSTM combination networks in text recognition models, we propose modifying the discriminator network in the CycleGAN model to a combined CNN-BiLSTM network for better feature extraction from document images during classification by the discriminator network. The results demonstrate that the proposed model significantly enhances text preservation and OCR performance compared to the standard CycleGAN discriminator network. Specifically, when assessing the Tesseract engine's word accuracy on real-world noisy receipt images from the POS dataset, the proposed model achieved an improvement of up to 61.66% over the original CycleGAN model and 23.32% over the original noisy receipt images. Additionally, the proposed model consistently outperformed other unsupervised classical techniques across all OCR engines evaluated.

原文链接:

NETL
NSTL
Springer Nature