一种灵活高效的增量式Web平行语料抽取方法

Incrementally and Flexibly Extracting Parallel Corpus from Web

刘小峰 ¹郑禹铖 ¹李东阳¹

扫码查看

作者信息

1. 华中科技大学软件学院武汉 430074
折叠

摘要

从Web中抽取平行语料对于机器翻译和其他多语语言处理任务来说非常重要,由此提出了一种从 Web中灵活高效地增量抽取平行语料的方法,通过持续地对Common Crawl的Web抓取存档进行下载、扫描和分析统计,增量更新域名下的语言文本长度统计数据.对于任意给定的感兴趣目标语言对,抽取方法基于域名下的语言文本长度统计数据确定抓取网站入口,并根据目标语言进行定向抓取,忽略多语域名和目标语言外的链接.此外还提出了一种在多语域名内基于语义相似性进行全局对齐的新的句子对齐方法.实验表明,增量抽取能够持续不断地获得新的平行语料,根据指定的语言对进行抽取,可以灵活地获得感兴趣的目标语言对平行语料;新的对齐方法在对齐效率上明显优于全局方法,且能完成局部方法无法完成的对齐;在6个语言方向中,抽取到的平行语料在4个中低资源语言方向的质量优于现有Web开源平行语料,在2个高资源语言方向的质量接近现有最好的Web开源平行语料.

Abstract

Extracting parallel corpus from the web is important for machine translation and other multilingual processing tasks.This paper proposes an incremental web parallel corpus extraction method,which incrementally updates language text length sta-tistics for domains by continuously downloading,scanning and analyzing Common Crawl's web crawling archive.For any given interested language pairs,web sites to be crawled are determined based on language text length statistics for domains and crawled according to the target language pairs,and non-target domains and links are discarded.It also proposes a new intermediate sentence alignment method,which globally aligns sentences based on semantic similarity within multilingual domains.Experiments show that:1)our extraction method can continuously obtain new parallel corpus and flexibly obtain the target language pair of interest via extracting the specified language pairs;2)the proposed intermediate method is significantly better than the global method in terms of alignment efficiency,and can complete the alignment that cannot be completed by local methods;3)out of 6 language di-rections,the extracted parallel corpora are superior to existing web open source parallel corpus in 4 medium-low resource langua-ges and close to the best available web open source parallel corpus in 2 high-resource languages.

关键词

平行语料抽取/句子对齐/语料库构建/机器翻译/Web挖掘

Key words

Parallel corpus extraction/Sentence alignment/Corpus construction/Machine translation/Web mining

引用本文复制引用

出版年

2024

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCDCSCD北大核心

影响因子：0.944

ISSN：1002-137X

参考文献量21

段落导航