中文信息学报2024,Vol.38Issue(8) :84-92.

基于跨语言学习的老挝语实体识别方法

Lao Entity Recognition Based on Cross-language Learning

邓喆 周兰江 周蕾越
中文信息学报2024,Vol.38Issue(8) :84-92.

基于跨语言学习的老挝语实体识别方法

Lao Entity Recognition Based on Cross-language Learning

邓喆 1周兰江 1周蕾越1
扫码查看

作者信息

  • 1. 昆明理工大学信息工程与自动化学院,云南昆明 650500
  • 折叠

摘要

传统的命名实体识别系统主要是有监督的机器学习模型,这种方法需要大量的手动标注数据才能实现比较好的效果,难以适用于老挝语这种低资源语言.该文在对汉语和老挝语结构特点进行研究后,针对实验室目前获取的大量汉-老平行句对提出了一种基于跨语言学习的老挝语实体识别方法,该方法仅需要汉-老平行句对,而无需大量命名实体标注数据.首先,利用开源命名实体识别工具在汉语端进行命名实体标注;然后,利用跨语言表示和相似度计算将标注从汉语端投影到老挝语端并进行后处理;最后,使用融合词性特征和音节特征的字符向量训练命名实体识别模型.实验表明,基于跨语言学习的老挝语实体识别模型的F1值达到了 74.29%.

Abstract

The classical named entity recognition is based on supervised machine learning,which is difficult to be ap-plied for low-resource languages such as Lao due to the reliance on annotated data.After analysing the structural fea-tures of Chinese and Lao,this paper proposes a named entity recognition method for Lao based on cross-language learning for a large Chinese-Lao parallel sentences.This method first uses the open source named entity recognition tool to annotate the Chinese sentences.Then,it uses the cross-language representation and similarity calculation to project the annotation from the Chinese-side to the Lao language.The final named entity recognition model for Lao is trained by character vector combined with the part-of-speech feature and syllable feature.Experiments show that the F1 value of the proposed method reaches 74.29%for Lao named entity recognition.

关键词

老挝语/命名实体识别/弱监督学习/跨语言词向量

Key words

Lao/named entity recognition/weakly supervised learning/cross-language word vector

引用本文复制引用

基金项目

国家自然科学基金(61662040)

出版年

2024
中文信息学报
中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCSCDCHSSCD北大核心
影响因子:0.8
ISSN:1003-0077
参考文献量28
段落导航相关论文