Generation of Enrich Semantic Video Dialogue Based on Hierarchical Visual Attention
As an important research direction in the field of multimodal human-computer interaction,video dialogue emerges.The large amount of temporal and spatial visual information and complex multimodal relationships makes it challenging to design effi-cient video dialogue systems.Existing video dialogue systems utilize cross-modal attention mechanisms or graph structures to cap-ture the correlation between video semantics and dialogue context.However,all visual information is processed with a single coarse granularity.It results in a loss of some fine-grained temporal and spatial information,such as the continuous motion of the same object and the insignificant position information of an image.Moreover,the fine-grained process of all visual information in-creases the delay and degrades the dialogue fluency.Therefore,we propose a hierarchical visual attention-based semantic-rich video dialogue generation method in this paper.Firstly,according to the dialogue context,global visual semantic information is captured by using global visual attention and located to the time sequence/spatial scope of the video associated with the dialogue input.Secondly,the local attention mechanism is used to further capture fine-grained visual information in the localized area,and to generate the dialogue response by exploiting the multi-task learning method.Experimental results on DSTC7 AVSD datasets show that the dialogue generated by the proposed method has higher accuracy and variety,and its METEOR index improves by 23.24%.