English-Khmer Bilingual Parallel Sentences Extraction Based on Maximum Entropy Model
English-Khmer bilingual parallel corpora is a basic resource of the Khmer information processing, and it is very important to promote the development of the Khmer information processing.The issue of obtaining the parallel sentence pairs is regarded as classification of candidate parallel sentence pairs after obtaining the parallel bilingual website.We construct a maximum entropy classifier to identify the parallel sentence pairs from the candidates.We train the English-Khmer bilingual sentence pairs classifier by adopting the features of the sentence length, the ratio of the characteristic vocabulary, the sentence position and the characteristics.Finally, we use this English-Khmer bilingual classifier to classify the candidate English-Khmer parallel sentence pairs, thus we can determine the resources of English-Khmer parallel bilingual sentence pairs.The experiment shows that compared with the ones with differert features,the classer has a high precision and recall rate that is more than 90 percent at last.It suggests that it can have a better performance by identifying the parallel sentences.