智能系统学报2024,Vol.19Issue(2) :446-454.DOI:10.11992/tis.202209048

结合多尺度注意力机制和双向门控循环网络的视频摘要模型

Video summarization model based on the multiscale attention mechanism and bidirectional gated recurrent network

闫河 刘灵坤 黄俊滨 张烨 段思宇
智能系统学报2024,Vol.19Issue(2) :446-454.DOI:10.11992/tis.202209048

结合多尺度注意力机制和双向门控循环网络的视频摘要模型

Video summarization model based on the multiscale attention mechanism and bidirectional gated recurrent network

闫河 1刘灵坤 1黄俊滨 1张烨 1段思宇1
扫码查看

作者信息

  • 1. 重庆理工大学 两江人工智能学院, 重庆 401135
  • 折叠

摘要

针对视频摘要任务中全局注意力在长距离视频序列上注意力值分布的方差较大,生成关键帧的重要性分数偏差较大,且时间序列节点边界值缺乏长程依赖导致的片段语义连贯性较差等问题,通过改进注意力模块,采用分段局部自注意力和全局自注意力机制相结合来获取局部和全局视频序列关键特征,降低注意力值的方差.同时通过并行地引入双向门控循环网络(bidirectional recurrent neural network,BiGRU),二者的输出分别输入到改进的分类回归模块后再将结果进行加性融合,最后利用非极大值抑制(non-maximum suppression,NMS)和核时序分割方法(kernel temporal segmentation,KTS)筛选片段并分割为高质量代表性镜头,通过背包组合优化算法生成最终摘要,从而提出一种结合多尺度注意力机制和双向门控循环网络的视频摘要模型(local and global attentions combine with the BiGRU,LG-RU).该模型在TvSum和SumMe的标准和增强数据集上进行了对比试验,结果表明该模型取得了更高的F-score,证实了该视频摘要模型保持高准确率的同时可鲁棒地对视频完成摘要.

Abstract

In the video summary task,the variance of global attention value distribution on long distance video se-quences is large,the importance score of generating key frames is large,and the semantic coherence of fragments is poor due to the lack of long-range dependence on the boundary values of time series nodes.Herein,by improving the atten-tion module,segmented local self-attention and global self-attention mechanisms are merged to acquire the key features of local and global video sequences and lower the variance of attention values.Concurrently,the bidirectional gated re-current neural network(BiGRU)is introduced in parallel,the output is input into the enhanced classification regression module,and afterward,the results are additively fused.Lastly,nonmaximum suppression and kernel temporal segmenta-tion methods are applied to filter fragments and segment them into high-quality representative shots.The final summary is created by the knapsack combinatorial optimization algorithm.The video summary model LG-RU,which integrates the multiscale attention mechanism and BiGRU,is developed and compared with TvSum and SumMe's standard and en-hanced data sets.It is demonstrated that the model has a higher F-score,which verifies that this model can complete the video summary robustly while preserving high accuracy.

关键词

视频摘要/自注意力机制/重要性分数/长程依赖/计算机视觉/双向门控循环神经网络/非极大值抑制/核时序分割方法

Key words

video summary/self-attention mechanism/importance score/long-range dependence/computer vision/Bi-GRU/nonmaximum suppression(NMS)/kernel temporal segmentation(KTS)

引用本文复制引用

基金项目

国家重点研发计划"智能机器人"重点专项(2018YFB1308602)

国家自然科学基金面上项目(61173184)

重庆市自然科学基金(cstc2018jcy-jAX0694)

出版年

2024
智能系统学报
中国人工智能学会 哈尔滨工程大学

智能系统学报

CSTPCD北大核心
影响因子:0.672
ISSN:1673-4785
参考文献量27
段落导航相关论文