计算机科学2025,Vol.52Issue(1) :315-322.DOI:10.11896/jsjkx.231100107

基于层次化视觉注意力的富语义视频对话生成

Generation of Enrich Semantic Video Dialogue Based on Hierarchical Visual Attention

赵倩 郭斌 刘宇博 孙卓 王豪 陈梦琦
计算机科学2025,Vol.52Issue(1) :315-322.DOI:10.11896/jsjkx.231100107

基于层次化视觉注意力的富语义视频对话生成

Generation of Enrich Semantic Video Dialogue Based on Hierarchical Visual Attention

赵倩 1郭斌 1刘宇博 1孙卓 1王豪 1陈梦琦1
扫码查看

作者信息

  • 1. 西北工业大学计算机学院 西安 710129
  • 折叠

摘要

视频对话是多模态人机交互领域中的重要内容.视频对话中包含大量时空视觉信息和复杂的多模态关系,这给相关研究带来了巨大的挑战.现有的视频对话模型利用跨模态注意力机制或图结构捕捉视频语义和对话上下文之间的相关性,然而,所有视觉信息均是在单一粗粒度下处理的,这导致模型容易忽略一些细粒度时空信息,如同一物体在时间上的持续运动或图像不显著位置的物体信息,从而降低了视频对话性能.同时,细粒度处理全部视觉信息又将增加处理时延,降低视频对话的流畅性.因此,提出了一种层次化视觉注意力的富语义视频对话生成方法.首先根据对话上下文,利用全局视觉注意力捕捉全局视觉语义信息,并定位到对话输入关注的视频时间序列/空间范围,其次利用局部注意力机制进一步捕捉细粒度视觉信息,结合多任务学习方法,生成对话回复.在DSTC7 AVSD数据集上的实验结果表明,相比现有基准方法,所提方法生成的对话具备更高的准确性和多样性,其中METEOR指标提高了 23.24%.

Abstract

As an important research direction in the field of multimodal human-computer interaction,video dialogue emerges.The large amount of temporal and spatial visual information and complex multimodal relationships makes it challenging to design effi-cient video dialogue systems.Existing video dialogue systems utilize cross-modal attention mechanisms or graph structures to cap-ture the correlation between video semantics and dialogue context.However,all visual information is processed with a single coarse granularity.It results in a loss of some fine-grained temporal and spatial information,such as the continuous motion of the same object and the insignificant position information of an image.Moreover,the fine-grained process of all visual information in-creases the delay and degrades the dialogue fluency.Therefore,we propose a hierarchical visual attention-based semantic-rich video dialogue generation method in this paper.Firstly,according to the dialogue context,global visual semantic information is captured by using global visual attention and located to the time sequence/spatial scope of the video associated with the dialogue input.Secondly,the local attention mechanism is used to further capture fine-grained visual information in the localized area,and to generate the dialogue response by exploiting the multi-task learning method.Experimental results on DSTC7 AVSD datasets show that the dialogue generated by the proposed method has higher accuracy and variety,and its METEOR index improves by 23.24%.

关键词

多模态人机交互/层次化注意力机制/多任务学习/场景感知

Key words

Multi-modal human-computer interaction/Hierarchical attention mechanism/Multi-task learning/Scene perception

引用本文复制引用

出版年

2025
计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

北大核心
影响因子:0.944
ISSN:1002-137X
段落导航相关论文