Neural Networks2022,Vol.14610.DOI:10.1016/j.neunet.2021.11.017

Event-centric multi-modal fusion method for dense video captioning

Chang Z. Zhao D. Chen H. Li J. Liu P.
Neural Networks2022,Vol.14610.DOI:10.1016/j.neunet.2021.11.017

Event-centric multi-modal fusion method for dense video captioning

Chang Z. 1Zhao D. 1Chen H. 1Li J. 1Liu P.1
扫码查看

作者信息

  • 1. Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology Tianjin University
  • 折叠

Abstract

? 2021 Elsevier LtdDense video captioning aims to automatically describe several events that occur in a given video, which most state-of-the-art models accomplish by locating and describing multiple events in an untrimmed video. Despite much progress in this area, most current approaches only encode visual features in the event location phase and they neglect the relationships between events, which may degrade the consistency of the description in the identical video. Thus, in the present study, we attempted to exploit visual–audio cues to generate event proposals and enhance event-level representations by capturing their temporal and semantic relationships. Furthermore, to compensate for the major limitation of not fully utilizing multimodal information in the description process, we developed an attention-gating mechanism that dynamically fuses and regulates the multi-modal information. In summary, we propose an event-centric multi-modal fusion approach for dense video captioning (EMVC) to capture the relationships between events and effectively fuse multi-modal information. We conducted comprehensive experiments to evaluate the performance of EMVC based on the benchmark ActivityNet Caption and YouCook2 data sets. The experimental results showed that our model achieved impressive performance compared with state-of-the-art methods.

Key words

Dense video captioning/Event-centric/Multi-modal fusion

引用本文复制引用

出版年

2022
Neural Networks

Neural Networks

EISCI
ISSN:0893-6080
被引量7
参考文献量57
段落导航相关论文