Exploring Chinese Pre-Training with Mixed-Grained Encoding and IDF-Masking
Currently,most Chinese pre-trained language models adopt character-level encoding,which has a large computational overhead for long sequences.Although word-level encoding can alleviate this issue,it also brings some other issues such as out-of-vocabulary words and data sparsity.In this paper,we improve Chinese pre-trained language models with mixed-grained tokenization.The vocabulary of our encoding is obtained from large-scale corpo-ra and thereby can alleviate the issues of out-of-vocabulary and data sparsity.To further improve the pre-training ef-ficiency,we introduce a selectively masked language modeling method:IDF-masking,based on the inverse document frequency(IDF)collected on the pre-training corpora.The extensive experiments show that,compared with previous Chinese pre-trained language models,the proposed model can achieve better or comparable perform-ance on various Chinese natural language processing tasks,and encode text more efficiently.
Chinese pre-trainingmixed-grained encodingIDF-masking