融合数据增强和知识迁移的汉维跨语言命名实体识别
Chinese-Uyghur cross-lingual named entity recognition by fusing data augmentation and knowledge migration
葛一飞 1艾孜尔古丽 2陈德刚1
作者信息
- 1. 新疆师范大学计算机科学技术学院,新疆 乌鲁木齐 830054
- 2. 新疆师范大学计算机科学技术学院,新疆 乌鲁木齐 830054;国家语言资源监测与研究少数民族语言中心,北京 100081
- 折叠
摘要
针对维吾尔语命名实体识别任务数据匮乏的问题,提出汉维跨语言命名实体识别零样本迁移方法.采用一种简单有效的序列标记翻译方式,将源语言训练数据翻译为目标语言数据,避免词序变化和实体跨度不确定等问题,结合源语言数据和翻译后得到的数据,引入一种基于相似度计算的实体增强方法,可以有效提高文本生成质量,进一步增加样本的多样性.通过一系列广泛的试验,这些增强数据使少数民族预训练语言模型(Chinese minority pre-trained language model,CINO)能够更好地实现知识迁移目标语言的特定语言特征和多语言的语言独立特征,在多语言数据增强跨语言知识迁移模型上F1值达到86.50%,相比于基线模型提升 7.42%,证明融合数据增强和知识迁移的汉维跨语言命名实体识别的可行性.
Abstract
A zero-sample migration method for Chinese-Uyghur cross-lingual named entity recognition was proposed to address the problem of data scarcity for the Uyghur named entity recognition task.A simple and effective sequence-tagged translation method was used to translate the source language training data into the target language data,avoiding problems such as word order variation and entity span uncertainty.A similarity calculation-based entity augmentation method was introduced by combining the source language data and the translated data,which could effectively improve the quality of text generation and further increase the diversity of samples.Through a series of extensive experiments,these augmented data enabled the Chinese minority pre-trained language model(CINO)to better knowledge transfer the language-specific features of the target language and the language-independent features of multiple languages,reached an F1 value of 86.50%on the multilingual data augmented cross-lingual knowledge transfer model,an improvement of 7.42%compared to the baseline model,which demonstrated that Chinese-Uyghur cross-lingual named entity recognition by fusing data augmentation and knowledge migration was feasible.
关键词
汉维跨语言/命名实体识别/数据增强/知识迁移/CINOKey words
Chinese-Uyghur cross-lingual/named entity recognition/data augmentation/knowledge migration/CINO引用本文复制引用
基金项目
新疆维吾尔自治区创新环境(人才、基地)建设专项-自然科学计划(少数民族科技人才特殊培养)资助项目(2022D03001)
国家自然科学基金资助项目(61662081)
国家社会科学基金资助项目(14AZD11)
新疆师范大学青年拔尖人才资助项目(XJNUQB2022-22)
出版年
2024