基于门控机制多模态信息融合的图像描述翻译

Context Gate Based Multimodal Information Fusion for Image Description Translation

李志峰 ¹徐旻涵 ¹洪宇 ¹姚建民 ¹周国栋¹

扫码查看

作者信息

1. 苏州大学计算机科学与技术学院,江苏苏州 215006
折叠

摘要

图像描述翻译是给定图像和图像对应某一语言的描述,采用神经网络以端到端方式融合图像和文本两种模态信息,利用翻译技术为图像描述生成目标语言的任务.传统图像描述翻译,在将源语言翻译成目标语言时,借助图像中的重要特征优化翻译过程.翻译过程中,目标词的生成依赖于源语言上下文和目标语言上下文信息.通过观察发现,源语言上下文偏于影响翻译结果的充分性和忠实度,而目标语言上下文偏于影响翻译结果的流畅性和衔接度.由于缺少有效机制来调节两种上下文信息的贡献度,翻译模型会生成流畅但不充分或者充分但不流畅的句子.针对以上问题,该文提出一种基于门控机制多模态信息融合的解码方法,用于优化现有图像描述翻译模型.该文模型通过源上下文门控调整图像特征和每个源语言词的重要度,过滤掉图像中不相关的特征;通过目标上下文门控动态调整源语言上下文和目标语言上下文对翻译结果的贡献度,从而有效提高翻译结果的充分性和流畅性.在Multi30k数据集上进行实验,验证了上述方法的有效性,在Multi30k-16英德和英法以及Multi30k-17英德和英法测试集上,BLEU-4值对比基准系统分别提升了 1.3、1.0、1.5和1.4个百分点.

Abstract

Image description translation translate image description with the image modal information in an end-to-end system.The traditional image description translation is to assist the translation of the source language by using the vital feature in the image.To capture the source language context that affects the adequacy of the translation to-gether with the target language context that affects the fluency,this paper proposes a multi-modal information fu-sion decoding method based on gating mechanism for the image description translation.Our model uses context gates to dynamically adjusts the contribution of the source and target language contexts to the translation results,improving both the adequacy and fluency of translation results.Experiments show that the method increases the per-formance of image description translation with 1.3％,1.0％,1.5％and 1.4％,respectively,on the four tasks of En-De and En-Fr in Multi30k-16 and Multi30k-17.

关键词

图像描述翻译/多模态机器翻译/上下文门控/忠实度及流畅度

Key words

image description translation/multimodal machine translation/context gates/adequacy and fluency

引用本文复制引用

基金项目

国家自然科学基金(62076174)

国家自然科学基金(61773276)

国家自然科学基金(61836007)

出版年

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCSCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

参考文献量37

段落导航