首页|基于跨语言学习的老挝语实体识别方法

基于跨语言学习的老挝语实体识别方法

扫码查看
传统的命名实体识别系统主要是有监督的机器学习模型,这种方法需要大量的手动标注数据才能实现比较好的效果,难以适用于老挝语这种低资源语言.该文在对汉语和老挝语结构特点进行研究后,针对实验室目前获取的大量汉-老平行句对提出了一种基于跨语言学习的老挝语实体识别方法,该方法仅需要汉-老平行句对,而无需大量命名实体标注数据.首先,利用开源命名实体识别工具在汉语端进行命名实体标注;然后,利用跨语言表示和相似度计算将标注从汉语端投影到老挝语端并进行后处理;最后,使用融合词性特征和音节特征的字符向量训练命名实体识别模型.实验表明,基于跨语言学习的老挝语实体识别模型的F1值达到了 74.29%.
Lao Entity Recognition Based on Cross-language Learning
The classical named entity recognition is based on supervised machine learning,which is difficult to be ap-plied for low-resource languages such as Lao due to the reliance on annotated data.After analysing the structural fea-tures of Chinese and Lao,this paper proposes a named entity recognition method for Lao based on cross-language learning for a large Chinese-Lao parallel sentences.This method first uses the open source named entity recognition tool to annotate the Chinese sentences.Then,it uses the cross-language representation and similarity calculation to project the annotation from the Chinese-side to the Lao language.The final named entity recognition model for Lao is trained by character vector combined with the part-of-speech feature and syllable feature.Experiments show that the F1 value of the proposed method reaches 74.29%for Lao named entity recognition.

Laonamed entity recognitionweakly supervised learningcross-language word vector

邓喆、周兰江、周蕾越

展开 >

昆明理工大学信息工程与自动化学院,云南昆明 650500

老挝语 命名实体识别 弱监督学习 跨语言词向量

国家自然科学基金

61662040

2024

中文信息学报
中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心
影响因子:0.8
ISSN:1003-0077
年,卷(期):2024.38(8)