IAMC-BERT:a Mongolian Pre-training Language Model with Agglutinative Characteristics
Pre-training Language Model(PLM)has become crucial in achieving the best results in various Natural Language Processing(NLP)tasks.However,the current progress,which mainly focuses on significant languages such as English and Chinese,has yet to thoroughly investigate the low-resource languages,particularly agglutinative languages like Mongolian,due to the scarcity of large-scale data resources and the difficulty of understanding agglutinative knowledge.To address the issue of data scarcity,we create a large-scale Mongolian PLM dataset and three datasets for three downstream tasks:News Classification,Name Entity Recognition(NER),Part-of-Speech(POS)prediction,etc.This paper proposes a novel PLM for the Mongolian language,IAMC-BERT,which incorporates Mongolian agglutinative characteristics.We integrated Mongolian adhesive features into the tokenization stage and PLM training stage;specifically,the tokenization stage aims to convert the Mongolia word sequence to the fine-grained sub-word token that comprises a stem and some suffi-xes;the PLM training stage designed a morphological knowledge-based masking strategy to enhance the model's ability to learn agglutinative knowledge.The experimental results on three downstream tasks demonstrate that our method surpasses the traditional BERT approach and successfully learns agglutinative language knowledge in Mongolian.
pre-training language modelMongolianagglutinative knowledge