融入黏着语特征的蒙古文预训练语言模型

扫码查看

原文链接

万方数据
维普

中文摘要：预训练语言模型(PLM)在自然语言处理(NLP)任务上应用广泛且表现优异.目前预训练语言模型主要在英语和中文等资源丰富的语言上进行训练,由于缺乏大规模的数据资源及语言特征的复杂性,导致预训练语言模型尚未在低资源语言上进行深入研究,特别是蒙古文等黏着语.为了解决数据稀缺的问题,本研究创建了大规模的蒙古语预训练数据集并建立了三个下游任务的数据集,分别为新闻分类任务、命名实体识别任务(NER)、词性标注任务(POS),在此基础上提出了一种融入黏着语特征的蒙古文预训练语言模型IAMC-BERT.该模型将蒙古文黏着语特性融入tokenization阶段和预训练语言模型训练阶段.具体来说,to-kenization阶段旨在将蒙古文单词序列转换为包括词干和一些后缀的细粒度子词;训练阶段设计了一种基于形态学的掩蔽策略,以增强模型学习黏着语特征的能力.在三个下游任务上的实验结果表明,该方法超越了传统的BERT方法,成功地融入了蒙古文黏着语特征.

外文标题：IAMC-BERT:a Mongolian Pre-training Language Model with Agglutinative Characteristics

外文摘要：Pre-training Language Model(PLM)has become crucial in achieving the best results in various Natural Language Processing(NLP)tasks.However,the current progress,which mainly focuses on significant languages such as English and Chinese,has yet to thoroughly investigate the low-resource languages,particularly agglutinative languages like Mongolian,due to the scarcity of large-scale data resources and the difficulty of understanding agglutinative knowledge.To address the issue of data scarcity,we create a large-scale Mongolian PLM dataset and three datasets for three downstream tasks:News Classification,Name Entity Recognition(NER),Part-of-Speech(POS)prediction,etc.This paper proposes a novel PLM for the Mongolian language,IAMC-BERT,which incorporates Mongolian agglutinative characteristics.We integrated Mongolian adhesive features into the tokenization stage and PLM training stage;specifically,the tokenization stage aims to convert the Mongolia word sequence to the fine-grained sub-word token that comprises a stem and some suffi-xes;the PLM training stage designed a morphological knowledge-based masking strategy to enhance the model's ability to learn agglutinative knowledge.The experimental results on three downstream tasks demonstrate that our method surpasses the traditional BERT approach and successfully learns agglutinative language knowledge in Mongolian.

外文关键词：

pre-training language modelMongolianagglutinative knowledge

作者：

娜木汗、金筱霖、王炜华

展开 >

作者单位：

内蒙古大学计算机学院,内蒙古呼和浩特 010010

蒙古文智能信息处理技术国家地方联合工程研究中心,内蒙古呼和浩特,010010

内蒙古自治区蒙古文信息处理技术重点实验室,内蒙古呼和浩特 010010

内蒙古大学公共管理学院,内蒙古呼和浩特 010010

展开 >

关键词：

预训练语言模型蒙古文黏着语特征

出版年：

2024

中央民族大学学报(自然科学版)

中央民族大学

中央民族大学学报(自然科学版)

影响因子：0.462

ISSN：1005-8036

年,卷(期)：2024.33(3)