首页|融合数据增强和知识迁移的汉维跨语言命名实体识别

融合数据增强和知识迁移的汉维跨语言命名实体识别

扫码查看
针对维吾尔语命名实体识别任务数据匮乏的问题,提出汉维跨语言命名实体识别零样本迁移方法.采用一种简单有效的序列标记翻译方式,将源语言训练数据翻译为目标语言数据,避免词序变化和实体跨度不确定等问题,结合源语言数据和翻译后得到的数据,引入一种基于相似度计算的实体增强方法,可以有效提高文本生成质量,进一步增加样本的多样性.通过一系列广泛的试验,这些增强数据使少数民族预训练语言模型(Chinese minority pre-trained language model,CINO)能够更好地实现知识迁移目标语言的特定语言特征和多语言的语言独立特征,在多语言数据增强跨语言知识迁移模型上F1值达到86.50%,相比于基线模型提升 7.42%,证明融合数据增强和知识迁移的汉维跨语言命名实体识别的可行性.
Chinese-Uyghur cross-lingual named entity recognition by fusing data augmentation and knowledge migration
A zero-sample migration method for Chinese-Uyghur cross-lingual named entity recognition was proposed to address the problem of data scarcity for the Uyghur named entity recognition task.A simple and effective sequence-tagged translation method was used to translate the source language training data into the target language data,avoiding problems such as word order variation and entity span uncertainty.A similarity calculation-based entity augmentation method was introduced by combining the source language data and the translated data,which could effectively improve the quality of text generation and further increase the diversity of samples.Through a series of extensive experiments,these augmented data enabled the Chinese minority pre-trained language model(CINO)to better knowledge transfer the language-specific features of the target language and the language-independent features of multiple languages,reached an F1 value of 86.50%on the multilingual data augmented cross-lingual knowledge transfer model,an improvement of 7.42%compared to the baseline model,which demonstrated that Chinese-Uyghur cross-lingual named entity recognition by fusing data augmentation and knowledge migration was feasible.

Chinese-Uyghur cross-lingualnamed entity recognitiondata augmentationknowledge migrationCINO

葛一飞、艾孜尔古丽、陈德刚

展开 >

新疆师范大学计算机科学技术学院,新疆 乌鲁木齐 830054

国家语言资源监测与研究少数民族语言中心,北京 100081

汉维跨语言 命名实体识别 数据增强 知识迁移 CINO

新疆维吾尔自治区创新环境(人才、基地)建设专项-自然科学计划(少数民族科技人才特殊培养)资助项目国家自然科学基金资助项目国家社会科学基金资助项目新疆师范大学青年拔尖人才资助项目

2022D030016166208114AZD11XJNUQB2022-22

2024

山东大学学报(工学版)
山东大学

山东大学学报(工学版)

CSTPCD北大核心
影响因子:0.634
ISSN:1672-3961
年,卷(期):2024.54(4)