IHCCD:非规范手写汉字识别数据集

IHCCD:dataset for identification of irregular handwritten Chinese characters

季佳美 ¹邵允学 ¹季倓正¹

扫码查看

作者信息

1. 南京工业大学计算机与信息工程学院(人工智能学院),南京 211816
折叠

摘要

目的随着深度学习技术的快速发展,规范手写汉字识别(handwritten Chinese character recognition,HCCR)任务已经取得突破性进展,但对非规范书写汉字识别的研究仍处于萌芽阶段.受到书法流派和书写习惯等原因影响,手写汉字常常与打印字体差异显著,导致同类别文字的整体结构差异非常大,基于现有数据集训练得到的识别模型,无法准确识别非规范书写的汉字.方法为了推动非规范书写汉字识别的研究工作,本文制做了首套非规范书写的汉字数据集(irregular handwritten Chinese character dataset,IHCCD),目前共包含3 755个类别,每个类别有30幅样本.还给出了经典深度学习模型ResNet,CBAM-ResNet,Vision Transformer,Swin Transformer在本文数据集上的基准性能.结果实验结果表明,虽然以上经典网络模型在规范书写的CASIA-HWDB1.1数据集上能够取得良好性能,其中Swin Transformer在CASIA-HWDB1.1数据集上最高精度达到了 95.31％,但是利用CASIA-HWDB1.1训练集训练得到的网络模型,在IHCCD测试集上的识别结果较差,最高精度也只能达到30.20％.在加入IHCCD训练集后,所有的经典模型在IHCCD测试集上的识别性能均得到了较大提升,最高精度能达到89.89％,这表明IHCCD数据集对非规范书写汉字识别具有研究意义.结论现有OCR识别模型还存在局限性,本文收集的IHCCD数据集能够有效增强识别模型泛化性能.该数据集下载链接https://pan.baidu.com/s/1PtcfWj3yUSz68o2ZzvPJOQ?pwd=66Y7.

Abstract

Objective With the rapid development of deep learning technology,the task of handwritten Chinese character recognition(HCCR)has made breakthrough progress.Initially,text recognition research focused primarily on the recogni-tion of English characters and numbers.However,with the deepening of artificial intelligence technology,numerous researchers have begun to focus on the field of Chinese character recognition.In recent years,Chinese character recogni-tion has been widely used in several application scenarios and currently has a wide range of application scenarios in the fields of bank bill recognition,mail sorting,and office automation.Chinese characters are the most widely used language in the world with the richest information meaning and are an important language carrier for people's communication.There-fore,the research on Chinese character recognition has a crucial value.However,despite these advancements,the recogni-tion of irregular handwritten Chinese characters remains a challenging task.Handwritten Chinese characters are often influ-enced by various calligraphic styles and individual writing habits,leading to notable deviations from regular printed fonts.These variations can result in considerable differences in the overall structure of characters within the same category.There-fore,recognition models trained on these regular datasets may struggle to accurately identify irregularly handwritten Chi-nese characters encountered in real-world scenarios.For example,when sending a picture to WeChat,the text in the picture may involve sensitive words.During the identification of words by the text recognition engine,if these words are regular writing,then the engine can accurately identify and filter these sensitive words.However,some people intentionally avoid the identifica-tion of the text recognition engine due to irregular handwriting to circumvent regulation;thus,the search engine cannot recognize these words.Therefore,the research on the recognition of irregular handwritten Chinese characters is of considerable importance and can be applied in the fields of information security and filtering.Method The dataset of irregular handwritten Chinese charac-ters can be classified into the following types:missing or wrong order of strokes,problems with the connection or separation of strokes,maliciously enlarged or shrunken radicals,serious distortion of the character shape,saki change of the form,and exces-sive horizontal and vertical amplitudes,resulting in misplacement of the entire spatial structure of the characters and easily lead-ing to ambiguities and misinterpretations.This paper collects the first irregular handwritten Chinese character dataset(IHCCD),which currently contains a total of 3 755 categories with 30 samples for each category to promote the research work on the recogni-tion of irregular handwritten Chinese characters.In the experiment,the first 20 samples were used as training samples,and the next 10 samples were used as test samples.IHCCD is performed by different irregular handwriters who handwrite on A4 printing paper and use a scanner as the input device to convert handwritten character samples into digital image samples.These irregular handwriters do not need to write exactly according to the regular Chinese character stroke order during the dataset collection pro-cess.They can freely adjust the stroke thickness,length,and position and enlarge or reduce the radicals arbitrarily.Moreover,they can change the tilt of the Chinese characters,resulting in distorted shapes and misaligned spatial structures,bypassing the current text recognition engine.A series of image processing techniques must be adopted for the collected dataset of irregular handwritten Chinese characters.These image processing techniques,including image skew correction,single character segmenta-tion,Otsu binarization,and character normalization,must be adopted to construct the IHCCD dataset.Result In this paper,detailed experiments were conducted on the IHCCD and CASIA-HWDB1.1 datasets to compare the recognition performance of the classical network models,such as ResNet,CBAM-ResNet,Vision Transformer,and Swin Transformer,under different experimental settings.,and the experimental results show that although the above classical network models can achieve good per-formance on the canonically written CASIA-HWDB1.1 dataset.Among them,Swin Transformer achieves the highest accuracy of 95.31％on the CASIA-HWDB1.1 dataset,but the network model trained using the CASIA-HWDB1.1 training set has poor rec-ognition results on the IHCCD test set,and the highest accuracy can only reach 30.20％.After adding the IHCCD training set,the recognition performance of all the classical models on the IHCCD test set is markedly improved,and the highest accuracy can only reach 30.20％,showing that the IHCCD dataset is crucial for the study of irregular written Chinese character recognition.Conclusion The existing optical character recognition(OCR)recognition models still have limitations,and the dataset collected in this paper can effectively enhance the generalization performance of the recognition models.However,even for the Swin Trans-former model,which has the best performance,a large gap still exists between the recognition accuracy of irregular written Chinese characters and that of regular written Chinese characters,which requires researchers to conduct further in-depth study on this problem.Link to download this dataset:https://pan.baidu.com/s/1PtcfWj3yUSz68o2ZzvPJOQ?pwd=66Y7.

关键词

非规范书写/手写汉字识别(HCCR)/IHCCD数据集/深度学习/经典分类模型

Key words

irregular writing/handwritten Chinese character recognition(HCCR)/IHCCD dataset/deep learning/classi-cal classification model

引用本文复制引用

出版年

2024

中国图象图形学报

中国科学院遥感应用研究所,中国图象图形学学会 ,北京应用物理与计算数学研究所

中国图象图形学报

CSTPCDCSCD北大核心

影响因子：1.111

ISSN：1006-8961

段落导航