结合文本自训练和对抗学习的领域自适应工业场景文本检测

Text self-training and adversarial learning-relevant domain adaptive industrial scene text detection

吕学强 ¹权伟杰 ¹韩晶 ¹陈玉忠 ²才藏太²

扫码查看

作者信息

1. 北京信息科技大学网络文化与数字传播北京市重点实验室,北京 100101
2. 青海师范大学省部共建藏语智能信息处理及应用国家重点实验室,西宁 810008
折叠

摘要

目的快速检测工业场景中的文本,可以提高生产效率、降低成本,然而数据的标注耗时耗力,鲜有标注信息可用,针对目前方法在应用到工业数据时存在伪标签质量低和域差距较大等问题,本文提出了一种结合文本自训练和对抗学习的领域自适应工业场景文本检测方法.方法首先,针对伪标签质量低的问题,采用教师学生框架进行文本自训练.教师和学生模型应用数据增强和相互学习缓解域偏移,提高伪标签的质量;其次,针对域差距,提出图像级和实例级对抗学习模块来对齐源域和目标域的特征分布,使网络学习域不变特征;最后,在两个对抗学习模块之间使用一致性正则化进一步缓解域差距,提高模型的域适应能力.结果实验证明,本文的方法在工业铭牌数据集的精确率、召回率和F1值分别达到96.2％、95.0％和95.6％,较基线模型分别提高了 10％、15.3％和12.8％.同时在ICDAR15和MSRA-TD500数据集上也表现出良好性能,与当前先进的方法相比,F1值分别提高0.9％和3.1％.此外,本文的方法在应用到EAST(efficient and accurate scene text detector)文本检测模型后,铭牌数据集的各指标分别提升5％,11.8％和9.5％.结论本文提出的方法成功缓解了源域与目标域数据之间的差距,显著提高了模型的泛化能力,并且具有良好的通用性,同时模型推理阶段不会增加计算成本.

Abstract

Objective The surface of industrial equipment records important information,such as equipment model,speci-fications,and functions,which are crucial for equipment management and maintenance.Traditional information collection relies on workers taking photos and recording,which is inefficient and hardly meets the current high-efficiency and low-cost production requirements.By utilizing scene text detection technology to detect text in industrial scenarios automatically,production efficiency and cost effectiveness can be improved,which is crucial for industrial intelligence and automation.The success of scene text detection algorithms relies heavily on the availability of large-scale,high-quality annotation data.However,in industrial scenarios,data collection and annotation are time consuming and labor intensive,resulting in a small amount of data and no annotation information,severely limiting model performance.Furthermore,substantial domain gaps exist between the"source domain"(public data)and the"target domain"(industrial scene data),making it difficult for models trained on public datasets to generalize directly to industrial scene text detection tasks.Therefore,we focus on researching domain adaptive scene text detection algorithms.However,when applied to industrial scene text detection,these methods encounter the following problems:1)Image translation methods achieve domain adaptation by generating similar target domain images,but this method focuses on adapting to low-frequency appearance information and are not effective in handling text detection tasks.2)The quality of pseudo labels generated by self-training methods is low and cannot be adaptively improved during training,limiting the model's domain adaptability.3)The adversarial feature alignment method disregards the influence of background noise and cannot effectively mitigate domain gaps.To address these issues,we propose a domain adaptive industrial scene text detection method called DA-DB++,which stands for domain-adaptive differentiable binarization,based on text self-training and adversarial learning.Method In this study,we address the issues of low-quality pseudo labels and domain gaps.First,we introduce a teacher-student self-training frame-work.Applying data augmentation.and mutual learning between teacher and student models,which enhances the robust-ness of the model,reduces domain bias and gradually generates high-quality pseudo labels during training.Specifically,the teacher model generates pseudo labels for data in the target domain,while the student model uses source domain data and pseudo labels for training.The exponential moving average of the student model is used to update the teacher model.Second,we propose image-level and instance-level adversarial learning modules in the student model to address the large domain gap.These modules align the feature distributions of the source and target domains,achieving domain-invariant learning within the network.Specifically,an image-level alignment module is added after the feature extraction network,and the coordinate attention mechanism is used to aggregate features along the horizontal and vertical spatial directions,improving the extraction of global-level features.This process helps reduce shifts caused by global image differences,such as image style and shape.The alignment of advanced semantic features can help the model better learn feature representa-tions,effectively reduce domain gaps,and improve the model's generalization ability.Instance-level alignment is imple-mented by using text labels for mask filtering.This process forces the network to focus on the text area and suppresses back-ground noise interference.Finally,two adversarial learning modules are regularized to alleviate domain gaps and improve the model's domain adaptability further.Result We conducted experiments and analysis with other domain adaptive text detection methods on the industrial nameplate dataset and public dataset to verify the effectiveness and robustness of our method.The experiments showed that each module of our proposed method contributes to the overall performance to vary-ing degrees.When the ICDAR2013 and nameplate datasets were respectively used as the source and target domains,our method attained accuracy,recall,and F1 values of 96.2％,95.0％,and 95.6％,respectively.These values were 10％,15.3％,and 12.8％higher than the baseline model DBNet++.This result indicates that our method alleviates domain gaps and offsets,generates high-quality pseudo labels,and improves the model's domain adaptability.Additionally,it demon-strates good performance on the ICDAR15 and MSRA-TD500 datasets,with F1 values increased by 0.9％and 3.1％,respectively,compared with state-of-the-art methods.In addition,applying our method to the efficient and accurate scene text detector(EAST)model results in a 5％,11.8％,and 9.5％increase in the accuracy,recall,and F1 values,respec-tively,on the nameplate dataset.Conclusion In this study,we propose a domain adaptive industrial scene text detection method to address the issue of low quality in pseudo labels and domain gaps between source and target,improving the model's domain adaptability on the target dataset.The experimental results and analysis indicate that the method proposed in this study remarkably enhances the domain adaptability of the DBNet++text detection model.It achieves state-of-the-art results in domain adaptation tasks for industrial nameplate and public text detection,thus verifying the effectiveness of the method proposed in this study.Additionally,experiments on the EAST model have demonstrated the universality of the pro-posed method.The model inference stage will not increase computational costs and time consumption.

关键词

场景文本检测/领域自适应/文本自训练/特征对抗学习/一致性正则化

Key words

scene text detection/domain adaptation/text self-training/feature adversarial learning/consistency regular-ization

引用本文复制引用

基金项目

国家自然科学基金项目(62171043)

北京市自然科学基金项目(4232025)

北京市教委科研计划科技一般项目(KM202311232003)

青海省创新平台建设专项(2022-ZJ-T02)

出版年

2024

中国图象图形学报

中国科学院遥感应用研究所,中国图象图形学学会 ,北京应用物理与计算数学研究所

中国图象图形学报

CSTPCDCSCD北大核心

影响因子：1.111

ISSN：1006-8961

参考文献量39

段落导航