探索中文预训练模型的混合粒度编码和IDF遮蔽

Exploring Chinese Pre-Training with Mixed-Grained Encoding and IDF-Masking

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：目前大多数中文预训练语言模型采用字级别编码,因为字符级编码序列长而产生大量计算开销.词级别编码尽管能够缓解这一问题,但也会带来其他问题,如词典外词、数据稀疏等.针对中文不同粒度的编码,该文提出使用混合粒度编码的中文预训练模型.这一编码所用的词表在大规模预训练语料上得到,因此缓解了词典外词和数据稀疏问题.为了更进一步增强模型性能,该文提出了一种选择性的遮蔽语言建模训练策略——IDF 遮蔽.这一策略基于词在大规模预训练语料上统计的逆文档频率.实验表明,与之前的中文预训练语言模型相比,该文所提出方法预训练的模型在多个中文自然语言数据集上取得了更好或相当的性能,并且能更高效地编码文本.

外文摘要：Currently,most Chinese pre-trained language models adopt character-level encoding,which has a large computational overhead for long sequences.Although word-level encoding can alleviate this issue,it also brings some other issues such as out-of-vocabulary words and data sparsity.In this paper,we improve Chinese pre-trained language models with mixed-grained tokenization.The vocabulary of our encoding is obtained from large-scale corpo-ra and thereby can alleviate the issues of out-of-vocabulary and data sparsity.To further improve the pre-training ef-ficiency,we introduce a selectively masked language modeling method:IDF-masking,based on the inverse document frequency(IDF)collected on the pre-training corpora.The extensive experiments show that,compared with previous Chinese pre-trained language models,the proposed model can achieve better or comparable perform-ance on various Chinese natural language processing tasks,and encode text more efficiently.

外文关键词：

Chinese pre-trainingmixed-grained encodingIDF-masking

作者：

邵云帆、孙天祥、邱锡鹏

展开 >

作者单位：

复旦大学计算机科学技术学院,上海 200433

关键词：

中文预训练混合粒度编码 IDF遮蔽

基金：

国家自然科学基金

项目编号：

62022027

出版年：

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

年,卷(期)：2024.38(1)

参考文献量24