Instead of directly providing the service of Chinese segmentation, some open-source software allows us to train segmentation models with segmented text. The resulting models can perform quite well, if training data of high quality are available. In reality, it is not easy to obtain sufficient and excellent training data, unfortunately. We report an exploration of using parallel corpora and various lexicons with techniques of identifying unknown words and near synonyms to automatically generate training data for such open-source software. We achieved promising results of segmentation in current experiments. Although the results fell short of outperforming the well-known Chinese segmenters, we believe that the proposed approach offers a viable alternative for users of the open-source software to generate their own training data.
機器學習語料標記機器翻譯
王瑞平、劉昭麟
展开 >
國立政治大學資訊科學系
機器學習 語料標記 機器翻譯
Conference on Computational Linguistics and Speech Processing
Chung-li(CN)
24th Conference on Computational Linguistics and Speech Processing