Chinese-Vietnamese Parallel Sentence Pair Extraction Method Based on Cross-lingual Bilingual Pre-training and Bi-LSTM

基于跨语言双语预训练及Bi-LSTM的汉-越平行句对抽取方法

刘畅 ¹高盛祥 ¹余正涛 ¹黄于欣 ¹尤丛丛¹

扫码查看

作者信息

1. 昆明理工大学,信息工程与自动化学院,昆明,650500,昆明理工大学,云南省人工智能重点实验室,昆明,650500
折叠

摘要

汉越平行句对抽取是缓解汉越平行语料库数据稀缺的重要方法.平行句对抽取可转换为同一语义空间下的句子相似性分类任务,其核心在于双语语义空间对齐.传统语义空间对齐方法依赖于大规模的双语平行语料,越南语作为低资源语言获取大规模平行语料相对困难.针对这个问题本文提出一种利用种子词典进行跨语言双语预训练及Bi-LSTM(Bi-directional Long Short-Term Memory)的汉-越平行句对抽取方法.预训练中仅需要大量的汉越单语和一个汉越种子词典,通过利用汉越种子词典将汉越双语映射到公共语义空间进行词对齐.再利用Bi-LSTM和CNN(Convolutional Neural Networks)分别提取句子的全局特征和局部特征从而最大化表示汉-越句对之间的语义相关性.实验结果表明,本文模型在F1得分上提升7.1%,优于基线模型.

Abstract

The extraction of Chinese-Vietnamese parallel sentence pairs is an important method to alleviate the scarcity of Chinese-Vietnamese parallel corpus data.Parallel sentence pair extraction can be converted into sentence similarity classification task in the same semantic space,the core of which is to achieve bilingual semantic space alignment.The traditional semantic space alignment method relies on large-scale bilingual parallel corpus,and it is relatively difficult for Vietnamese to obtain large-scale parallel corpus as a low-resource language.To address this problem,this paper proposes a bilingual dictionary for cross-lingual bilingual pre-training and Bi-LSTM(Bi-directional Long Short-Term Memory)Chinese-Vietnamese parallel sentence pair extraction method.Only a large number of Chinese-Vietnamese monolingual and a Chinese-Vietnamese seed dictionary are required for pre-training.By using the Chinese-Vietnamese seed dictionary to map the Chinese-Vietnamese bilingual to the common semantic space for word alignment.Then,Bi-LSTM and CNN(Convolutional Neural Networks)are used to extract the global and local features of sentences to maximize the semantic relevance between Chinese-Vietnamese sentence pairs.Experimental results show that the model in this paper improves F1 score by 7.1%，which is better than the baseline model.

关键词

汉-越/平行句对抽取/跨语言预训练/公共语义空间/Bi-LSTM

Key words

汉-越/平行句对抽取/跨语言预训练/公共语义空间/Bi-LSTM

引用本文复制引用

会议名称

Chinese National Conference on Computational Linguistic

会议地点

Haikou(CN)

会议母体文献

19th Chinese National Conference on Computational Linguistic

页码

457-466

出版时间

2020

段落导航