首页|探索中文预训练模型的混合粒度编码和IDF遮蔽

探索中文预训练模型的混合粒度编码和IDF遮蔽

扫码查看
目前大多数中文预训练语言模型采用字级别编码,因为字符级编码序列长而产生大量计算开销.词级别编码尽管能够缓解这一问题,但也会带来其他问题,如词典外词、数据稀疏等.针对中文不同粒度的编码,该文提出使用混合粒度编码的中文预训练模型.这一编码所用的词表在大规模预训练语料上得到,因此缓解了词典外词和数据稀疏问题.为了更进一步增强模型性能,该文提出了一种选择性的遮蔽语言建模训练策略——IDF 遮蔽.这一策略基于词在大规模预训练语料上统计的逆文档频率.实验表明,与之前的中文预训练语言模型相比,该文所提出方法预训练的模型在多个中文自然语言数据集上取得了更好或相当的性能,并且能更高效地编码文本.
Exploring Chinese Pre-Training with Mixed-Grained Encoding and IDF-Masking
Currently,most Chinese pre-trained language models adopt character-level encoding,which has a large computational overhead for long sequences.Although word-level encoding can alleviate this issue,it also brings some other issues such as out-of-vocabulary words and data sparsity.In this paper,we improve Chinese pre-trained language models with mixed-grained tokenization.The vocabulary of our encoding is obtained from large-scale corpo-ra and thereby can alleviate the issues of out-of-vocabulary and data sparsity.To further improve the pre-training ef-ficiency,we introduce a selectively masked language modeling method:IDF-masking,based on the inverse document frequency(IDF)collected on the pre-training corpora.The extensive experiments show that,compared with previous Chinese pre-trained language models,the proposed model can achieve better or comparable perform-ance on various Chinese natural language processing tasks,and encode text more efficiently.

Chinese pre-trainingmixed-grained encodingIDF-masking

邵云帆、孙天祥、邱锡鹏

展开 >

复旦大学 计算机科学技术学院,上海 200433

中文预训练 混合粒度编码 IDF遮蔽

国家自然科学基金

62022027

2024

中文信息学报
中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心
影响因子:0.8
ISSN:1003-0077
年,卷(期):2024.38(1)
  • 24