面向自然语言处理的现代缅文分词规范研制与应用

Research and Application of Modern Burmese Word Segmentation Scheme for Natural Language Processing

陈宇 ¹秦董洪 ²张慧 ²张啸岩 ²杨国影 ³欧江玲 ¹庞俊彩⁴

扫码查看

作者信息

1. 广西民族大学东南亚语言文化学院,广西南宁 530007
2. 广西民族大学人工智能学院,广西南宁 530007
3. 北京大学外国语学院,北京 100083
4. 云南大学外国语学院,云南昆明 650504;国防科技大学外国语学院,江苏南京 210000
折叠

摘要

缅文分词是缅甸语自然语言处理中不可缺少的基础性工作之一,而分词规范则是进行自动分词技术研究的关键问题.该文参考了中文、藏文等文种的分词经验,结合缅文自身特点、缅文在计算机中的编码特点和缅甸语语法,研制了一套较系统的、适用于现代缅文的分词规范;并基于该规范对缅文开源人工标注分词语料库myPOS 0.9进行人工重新标注,实验结果证明在6种常见分词算法条件下该分词规范性能更优.

Abstract

Burmese word segmentation is one of the indispensable basic tasks in Burmese language natural language processing,and word segmentation specification is the key problem in the research of automatic word segmentation.By referring to the experience of word segmentation in Chinese,Tibetan and other languages,and combining the characteristics of Burmese,the coding characteristics of Burmese in computer and the grammar of Burmese,the paper put forward to a set of relatively systematic word segmentation scheme suitable for modern Burmese;Based on this scheme,the Burmese open-source manual label word segmentation corpus is re-labeled.The experimental results show that the performance of this word segmentation scheme is better under the condition of six common word segmentation algorithm.

关键词

缅甸/自然语言处理/现代缅文/分词规范

Key words

Myanmar/Natural Language Processing/Modern Burmese/Word Segmentation Scheme

引用本文复制引用

基金项目

国家自然科学基金资助项目(61462009)

国家自然科学基金资助项目(61862007)

广西自然科学基金资助项目(2018GXNSFAA281269)

广西研究生教育创新计划项目(YCSW2023268)

广西民族大学教改项目(2021XJGY10)

出版年

2024

广西民族大学学报(自然科学版)

广西民族大学

广西民族大学学报(自然科学版)

影响因子：0.245

ISSN：1673-8462

参考文献量15

段落导航