基于组块的藏文依存句法分析及自动标注方法
Chunk-based Tibetan Dependency Parsing and Automatic Annotation Method
达瓦追玛 1曹玺 2尼玛扎西 1群诺 1道吉扎西1
作者信息
- 1. 西藏大学信息科学技术学院 西藏拉萨 850000;西藏大学西藏自治区藏文信息技术人工智能重点实验室 西藏拉萨 850000;西藏大学藏文信息技术教育部工程研究中心 西藏拉萨 850000;西藏大学西藏信息化省部共建协同创新中心 西藏拉萨 850000
- 2. 西藏大学信息科学技术学院 西藏拉萨 850000;西藏大学西藏信息化省部共建协同创新中心 西藏拉萨 850000
- 折叠
摘要
依存句法分析是自然语言处理领域核心技术之一,旨在通过分析句子中词语之间的依存关系来确定句法结构.目前,藏文依存句法分析研究面临着长句解析困难和粗粒度依存转化映射不全面等问题.为此,文章提出一种基于组块和细粒度词性匹配规则的藏文依存句法分析及自动标注方法.该方法首先完善了藏文依存句法标注体系,并基于该标注体系人工标注数据集,提取词性匹配规则,进而通过藏文句子组块识别,提高了长句解析的准确性,最后实现了一个藏文依存句法自动标注原型系统TDParser,并构建了含7335条依存句法的藏文依存句法树库.通过实验证明了TDParser的性能及自动标注数据的有效性.
Abstract
Dependency parsing is one of the core techniques in natural language processing,aiming to determine the syntactic structure of a sentence by analyzing the dependency relationships between words in a sentence.Cur-rently,the study of Tibetan dependency parsing is facing challenges such as difficulty in parsing long sentences and incomplete mapping of coarse-grained dependency conversions.To address these issues,a Tibetan depen-dency syntactic analysis and automatic annotation method based on chunks and fine-grained part-of-speech matching rules is proposed in this paper.This method begins with refining the Tibetan dependency syntax anno-tation system,then manually annotates datasets based on this system and extracts part-of-speech matching rules.Subsequently,it enhances the accuracy of parsing long sentences through Tibetan sentence chunk recogni-tion.Finally,it develops a prototype system named TDParser for automatic Tibetan dependency syntax annota-tion and constructs a Tibetan dependency syntax treebank containing 7 335 dependency syntax entries.Our ex-perimental results verified the performance of TDParser and the effectiveness of the automatic annotated data.
关键词
藏文/依存句法分析/组块/自动标注Key words
Tibetan/dpendency parsing/chunk/automatic annotation引用本文复制引用
基金项目
科技创新2030"新一代人工智能"重大项目(2022ZD0116102)
西藏大学研究生高水平人才培养计划项目(2021-GSP-S128)
出版年
2024