首页|應用平行語料建構中文断詞組件

應用平行語料建構中文断詞組件

扫码查看
不同於直接提供中文斷詞服務,網路上的開放軟體讓人們可以利用自有的訓練語料來訓練中文的斷詞模型,藉以實踐自己的斷詞功能。如若可以全人工方式建構斷詞訓練語料,則以目前的機器學習技術所訓練出來的模型,常常可以達到相當好的斷詞效果。然而,實務上全人工的標記工作常常是難以提供足夠多的訓練語料。本文利用中英平行語料與各類辭典,搭配中文未知詞和近義詞的偵測,先建構一個粗略的斷詞器,藉以產生訓練語料,最後再利用網路上的開放軟體來建構中文斷詞服務。在目前的實驗中,雖然依照我們的程序所得的斷詞服務未能立即獲得優於知名的中文斷詞服務的成效,但是表現卻相去不遠;我們所提出的訓練語料產生程序提供了一個一般人可以考慮的選擇。
應用平行語料建構中文断詞組件
Instead of directly providing the service of Chinese segmentation, some open-source software allows us to train segmentation models with segmented text. The resulting models can perform quite well, if training data of high quality are available. In reality, it is not easy to obtain sufficient and excellent training data, unfortunately. We report an exploration of using parallel corpora and various lexicons with techniques of identifying unknown words and near synonyms to automatically generate training data for such open-source software. We achieved promising results of segmentation in current experiments. Although the results fell short of outperforming the well-known Chinese segmenters, we believe that the proposed approach offers a viable alternative for users of the open-source software to generate their own training data.

機器學習語料標記機器翻譯

王瑞平、劉昭麟

展开 >

國立政治大學資訊科學系

機器學習 語料標記 機器翻譯

Conference on Computational Linguistics and Speech Processing

Chung-li(CN)

24th Conference on Computational Linguistics and Speech Processing

341-355

2012