基于跨语言学习的老挝语实体识别方法

Lao Entity Recognition Based on Cross-language Learning

扫码查看

原文链接

维普
万方数据

中文摘要：传统的命名实体识别系统主要是有监督的机器学习模型,这种方法需要大量的手动标注数据才能实现比较好的效果,难以适用于老挝语这种低资源语言.该文在对汉语和老挝语结构特点进行研究后,针对实验室目前获取的大量汉-老平行句对提出了一种基于跨语言学习的老挝语实体识别方法,该方法仅需要汉-老平行句对,而无需大量命名实体标注数据.首先,利用开源命名实体识别工具在汉语端进行命名实体标注;然后,利用跨语言表示和相似度计算将标注从汉语端投影到老挝语端并进行后处理;最后,使用融合词性特征和音节特征的字符向量训练命名实体识别模型.实验表明,基于跨语言学习的老挝语实体识别模型的F1值达到了 74.29％.

外文摘要：The classical named entity recognition is based on supervised machine learning,which is difficult to be ap-plied for low-resource languages such as Lao due to the reliance on annotated data.After analysing the structural fea-tures of Chinese and Lao,this paper proposes a named entity recognition method for Lao based on cross-language learning for a large Chinese-Lao parallel sentences.This method first uses the open source named entity recognition tool to annotate the Chinese sentences.Then,it uses the cross-language representation and similarity calculation to project the annotation from the Chinese-side to the Lao language.The final named entity recognition model for Lao is trained by character vector combined with the part-of-speech feature and syllable feature.Experiments show that the F1 value of the proposed method reaches 74.29％for Lao named entity recognition.

外文关键词：

Laonamed entity recognitionweakly supervised learningcross-language word vector

作者：

邓喆、周兰江、周蕾越

展开 >

作者单位：

昆明理工大学信息工程与自动化学院,云南昆明 650500

关键词：

老挝语命名实体识别弱监督学习跨语言词向量

基金：

国家自然科学基金

项目编号：

61662040

出版年：

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

年,卷(期)：2024.38(8)