Incrementally and Flexibly Extracting Parallel Corpus from Web
Extracting parallel corpus from the web is important for machine translation and other multilingual processing tasks.This paper proposes an incremental web parallel corpus extraction method,which incrementally updates language text length sta-tistics for domains by continuously downloading,scanning and analyzing Common Crawl's web crawling archive.For any given interested language pairs,web sites to be crawled are determined based on language text length statistics for domains and crawled according to the target language pairs,and non-target domains and links are discarded.It also proposes a new intermediate sentence alignment method,which globally aligns sentences based on semantic similarity within multilingual domains.Experiments show that:1)our extraction method can continuously obtain new parallel corpus and flexibly obtain the target language pair of interest via extracting the specified language pairs;2)the proposed intermediate method is significantly better than the global method in terms of alignment efficiency,and can complete the alignment that cannot be completed by local methods;3)out of 6 language di-rections,the extracted parallel corpora are superior to existing web open source parallel corpus in 4 medium-low resource langua-ges and close to the best available web open source parallel corpus in 2 high-resource languages.
Parallel corpus extractionSentence alignmentCorpus constructionMachine translationWeb mining