基于组块的藏文依存句法分析及自动标注方法

Chunk-based Tibetan Dependency Parsing and Automatic Annotation Method

达瓦追玛 ¹曹玺 ²尼玛扎西 ¹群诺 ¹道吉扎西¹

扫码查看

作者信息

1. 西藏大学信息科学技术学院西藏拉萨 850000;西藏大学西藏自治区藏文信息技术人工智能重点实验室西藏拉萨 850000;西藏大学藏文信息技术教育部工程研究中心西藏拉萨 850000;西藏大学西藏信息化省部共建协同创新中心西藏拉萨 850000
2. 西藏大学信息科学技术学院西藏拉萨 850000;西藏大学西藏信息化省部共建协同创新中心西藏拉萨 850000
折叠

摘要

依存句法分析是自然语言处理领域核心技术之一,旨在通过分析句子中词语之间的依存关系来确定句法结构.目前,藏文依存句法分析研究面临着长句解析困难和粗粒度依存转化映射不全面等问题.为此,文章提出一种基于组块和细粒度词性匹配规则的藏文依存句法分析及自动标注方法.该方法首先完善了藏文依存句法标注体系,并基于该标注体系人工标注数据集,提取词性匹配规则,进而通过藏文句子组块识别,提高了长句解析的准确性,最后实现了一个藏文依存句法自动标注原型系统TDParser,并构建了含7335条依存句法的藏文依存句法树库.通过实验证明了TDParser的性能及自动标注数据的有效性.

Abstract

Dependency parsing is one of the core techniques in natural language processing,aiming to determine the syntactic structure of a sentence by analyzing the dependency relationships between words in a sentence.Cur-rently,the study of Tibetan dependency parsing is facing challenges such as difficulty in parsing long sentences and incomplete mapping of coarse-grained dependency conversions.To address these issues,a Tibetan depen-dency syntactic analysis and automatic annotation method based on chunks and fine-grained part-of-speech matching rules is proposed in this paper.This method begins with refining the Tibetan dependency syntax anno-tation system,then manually annotates datasets based on this system and extracts part-of-speech matching rules.Subsequently,it enhances the accuracy of parsing long sentences through Tibetan sentence chunk recogni-tion.Finally,it develops a prototype system named TDParser for automatic Tibetan dependency syntax annota-tion and constructs a Tibetan dependency syntax treebank containing 7 335 dependency syntax entries.Our ex-perimental results verified the performance of TDParser and the effectiveness of the automatic annotated data.

关键词

藏文/依存句法分析/组块/自动标注

Key words

Tibetan/dpendency parsing/chunk/automatic annotation

引用本文复制引用

基金项目

科技创新2030"新一代人工智能"重大项目(2022ZD0116102)

西藏大学研究生高水平人才培养计划项目(2021-GSP-S128)

出版年

2024

高原科学研究

CSCD

ISSN：

参考文献量35

段落导航