汉越平行句对抽取是缓解汉越平行语料库数据稀缺的重要方法.平行句对抽取可转换为同一语义空间下的句子相似性分类任务,其核心在于双语语义空间对齐.传统语义空间对齐方法依赖于大规模的双语平行语料,越南语作为低资源语言获取大规模平行语料相对困难.针对这个问题本文提出一种利用种子词典进行跨语言双语预训练及Bi-LSTM(Bi-directional Long Short-Term Memory)的汉-越平行句对抽取方法.预训练中仅需要大量的汉越单语和一个汉越种子词典,通过利用汉越种子词典将汉越双语映射到公共语义空间进行词对齐.再利用Bi-LSTM和CNN(Convolutional Neural Networks)分别提取句子的全局特征和局部特征从而最大化表示汉-越句对之间的语义相关性.实验结果表明,本文模型在F1得分上提升7.1%,优于基线模型.
Abstract
The extraction of Chinese-Vietnamese parallel sentence pairs is an important method to alleviate the scarcity of Chinese-Vietnamese parallel corpus data.Parallel sentence pair extraction can be converted into sentence similarity classification task in the same semantic space,the core of which is to achieve bilingual semantic space alignment.The traditional semantic space alignment method relies on large-scale bilingual parallel corpus,and it is relatively difficult for Vietnamese to obtain large-scale parallel corpus as a low-resource language.To address this problem,this paper proposes a bilingual dictionary for cross-lingual bilingual pre-training and Bi-LSTM(Bi-directional Long Short-Term Memory)Chinese-Vietnamese parallel sentence pair extraction method.Only a large number of Chinese-Vietnamese monolingual and a Chinese-Vietnamese seed dictionary are required for pre-training.By using the Chinese-Vietnamese seed dictionary to map the Chinese-Vietnamese bilingual to the common semantic space for word alignment.Then,Bi-LSTM and CNN(Convolutional Neural Networks)are used to extract the global and local features of sentences to maximize the semantic relevance between Chinese-Vietnamese sentence pairs.Experimental results show that the model in this paper improves F1 score by 7.1%,which is better than the baseline model.
关键词
汉-越/平行句对抽取/跨语言预训练/公共语义空间/Bi-LSTM
Key words
汉-越/平行句对抽取/跨语言预训练/公共语义空间/Bi-LSTM
引用本文复制引用
会议名称
Chinese National Conference on Computational Linguistic
会议地点
Haikou(CN)
会议母体文献
19th Chinese National Conference on Computational Linguistic