字符敏感编辑距离的零样本汉字识别

Character-aware edit distance for zero-shot Chinese character recognition

陈宇 ¹王大寒 ¹池雪可 ¹江楠峰 ¹张煦尧 ²王驰明 ¹朱顺痣¹

扫码查看

作者信息

1. 厦门理工学院计算机与信息工程学院福建省模式识别与图像理解重点实验室,厦门 361024
2. 中国科学院自动化研究所多模态人工智能系统全国重点实验室,北京 100190
折叠

摘要

目的零样本汉字识别(zero-shot Chinese character recognition,ZSCCR)因其能在零或少训练样本下识别未见汉字而受到广泛关注.现有的零样本汉字识别方法大多采用基干部首序列匹配框架,即首先预测部首序列,然后根据表意描述序列(ideographic description sequence,IDS)字典进行最小编辑距离(minimum edit distance,MED)匹配.然而,现有的MED算法默认不同部首的替换代价、插入代价和删除代价相同,导致在匹配时候选字符类别存在距离代价模糊和冗余的问题.为此,提出了一种字符敏感编辑距离(character-aware edit distance,CAED)以正确匹配目标字符类别.方法通过设计多种部首信息提取方法,获得了更为精细化的部首描述,从而得到更精确的部首替换代价,提高了 MED的鲁棒性和有效性;此外,提出部首计数模块预测样本的部首数量,从而形成代价门控以约束和调整插入和删除代价,克服了 IDS序列长度预测不准确产生的影响.结果在手写汉字、场景汉字和古籍汉字等数据集上进行实验验证,与以往的方法相比,本文提出的CAED在识别未见汉字类别的准确率上分别提高了4.64％、1.1％和5.08％,同时对已见汉字类别保持相当的性能,实验结果充分表明了本方法的有效性.结论本文所提出的字符敏感编辑距离,使得替换、插入和删除3种编辑代价根据字符进行自适应调整,有效提升了对未见汉字的识别性能.

Abstract

Objective Zero-shot Chinese character recognition(ZSCCR)has attracted increasing attention in recent years due to its importance in recognizing unseen Chinese characters with zero/few training samples.The fundamental concept of zero-shot learning is to solve the new class recognition problem by generalizing semantic knowledge from seen classes to unseen classes,usually represented by auxiliary information such as attribute descriptions shared between different classes.Chinese characters comprise multiple radicals;therefore,radicals are often used as shared attributes between dif-ferent Chinese character classes.Most existing ZSCCR methods adopt the radical-based sequence matching framework that recognizes the character by predicting the radical sequence,followed by minimum edit distance(MED)matching based on the ideographic description sequence(IDS)dictionary.The MED can quickly compare the predicted radical sequences individually with the IDS dictionary to measure the difference between the two sequences and thus determine the unseen Chinese character category.However,this algorithm is mainly based on a framework where the insertion,deletion,and substitution costs are all set to 1,assuming that the cost is the same between all pairs of radicals.However,in practice,the substitution cost between similar radicals should be lower than that between non-similar radicals.Moreover,this approach needs increased flexibility due to the excessively long or short length of the predicted IDS sequence,resulting in redundant insertion or deletion costs.Consequently,a character-aware edit distance(CAED)is proposed to extract refined radical substitution costs,and the impacts of insertion and deletion costs are considered.Method The CAED in this study adaptively adjusts the cost of substitution,insertion,and deletion in edit distance to match the unseen Chinese character category according to the sensitivity of each Chinese character.In ZSCCR,the key to the radical-based approach lies in identifying radical sequences and the metrics between predicted and candidate sequences,and the accuracy of the metrics will directly determine the performance of the final model.Therefore,the metrics between radical sequences must be refined.Specifically,the CAED proposed in this paper analyzes the cost of editing distance.The similarity probability between different radicals is calculated as the substitution cost by assigning weights to the structure of the radicals,number of strokes,partials,and four-corner method information.Thus,the cost of the distance between different radicals is finely adjusted to improve the robustness and performance of MED.In addition,a radical counting module is introduced to pre-dict the number of radicals.Constraints on the cost of insertions and deletions are imposed by comparing the radical counts with the number of radicals in the predicted sequence to help mitigate the problem of excessively long or short predicted sequences of radicals.Therefore,refined distances are obtained between radical sequences.Compared to traditional meth-ods,the proposed method can accurately match the correct character class with the shortest distance,regardless of mis-recognition of similar radicals,mismatch of radical sequences,or both simultaneously.Result Experiments are conducted on the handwriting database(HWDB)and the 12th International Conference on Document Analysis and Recognition(ICDAR 2013)datasets,the Chinese text in the wild(CTW)datasets,and the ancient handwritten characters database(AHCDB).Initially,on the handwritten and scene Chinese character datasets,the proposed CAED consistently outper-formed current state-of-the-art methods in ZSCCR,demonstrating the superiority of CAED.Subsequently,CAED was inte-grated with other networks in the ancient Chinese dataset to emphasize its scalability.Additionally,the performance of the radical counting module was evaluated,recognizing its direct impact on cost gating.Subsequent ablation experiments vali-dated the effectiveness of the insertion and deletion cost constraint modules and the substitution cost refinement module.Combinatorial analysis was conducted on the multiple pieces of information contributing to the substitution cost to determine their respective values.Finally,traditional Chinese character recognition experiments were conducted to evaluate the per-formance of CAED in recognizing purely visible Chinese character categories,and the accuracy reached 97.02％on ICDAR 2013.Although it failed to reach optimal performance,CAED is still highly competitive and performs excellently in all comparison results.Experimental outcomes revealed a notable improvement in unseen Chinese character accuracy,with CAED achieving a 4.64％,1.1％,and 5.08％enhancement compared to other methods on the HWDB,ICDAR 2013,CTW,and AHCDB datasets.Conclusion A CAED for zero-shot Chinese character recognition,in which the cost of editing in edit distance depends adaptively on the character,is proposed.The method refines the substitution cost between radicals with multiple pieces of information,which can correct similar radicals misrecognized as confusing by the model.Moreover,introducing a radical counting module to form a cost gating is used to constrain the cost of insertions and dele-tions,thus alleviating the problem of mismatched radical sequence lengths.In addition,the method can be combined with any network based on radical sequence recognition to improve the resistance to misrecognition.

关键词

零样本汉字识别(ZSCCR)/表意描述序列(IDS)/编辑距离/字符敏感/部首信息/代价门控

Key words

zero-shot Chinese character recognition(ZSCCR)/ideographic description sequence(IDS)/edit distance/character-aware/radical information/cost gate

引用本文复制引用

出版年

2024

中国图象图形学报

中国科学院遥感应用研究所,中国图象图形学学会 ,北京应用物理与计算数学研究所

中国图象图形学报

CSTPCD北大核心

影响因子：1.111

ISSN：1006-8961

段落导航