首页|General and Domain Adaptive Chinese Spelling Check with Error Consistent
Pretraining
General and Domain Adaptive Chinese Spelling Check with Error Consistent
Pretraining
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
Arxiv
The lack of label data is one of the significant bottlenecks for Chinese
Spelling Check (CSC). Existing researches use the method of automatic
generation by exploiting unlabeled data to expand the supervised corpus.
However, there is a big gap between the real input scenario and automatic
generated corpus. Thus, we develop a competitive general speller ECSpell which
adopts the Error Consistent masking strategy to create data for pretraining.
This error consistency masking strategy is used to specify the error types of
automatically generated sentences which is consistent with real scene. The
experimental result indicates our model outperforms previous state-of-the-art
models on the general benchmark. Moreover, spellers often work within a
particular domain in real life. Due to lots of uncommon domain terms,
experiments on our built domain specific datasets show that general models
perform terribly. Inspired by the common practice of input methods, we propose
to add an alterable user dictionary to handle the zero-shot domain adaption
problem. Specifically, we attach a User Dictionary guided inference module (UD)
to a general token classification based speller. Our experiments demonstrate
that ECSpell$^{UD}$, namely ECSpell combined with UD, surpasses all the other
baselines largely, even approaching the performance on the general benchmark.