In low-resource neural machine translation,the translation quality of long sentences is generally poor,and the Chinese-Vietnamese languages are quite different,which is a typical resource-poor language.The processing of long sentences should keep the semantic information of the sentences unchanged as much as possible.Therefore,a method for processing long sentences based on syntactic structure features is pro-posed.Firstly,syntactic tree parsing is performed on long sentences in the original corpus,then short sen-tences are extractd according to the syntactic parse tree and leaf node words far away from the root node are marked.Finally,reverse translation on the extracted short sentences are used to generate pseudo-parallel data as an extension,and the weighted combination replacement training of the semantically similar words in the original long sentence is taken on the marked words.Experiments show that this method improves model performance and significantly improves the quality of long-sentence translations.
关键词
低资源神经机器翻译/长句译文/汉-越语言/语义信息/句法结构特征
Key words
low-resource neural machine translation/long sentences translation/Chinese-Vietnamese lan-guage/semantic information/syntactic structure features