基于事件最大边界的密集视频描述方法

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：针对基于集合预测的密集视频描述方法由于缺乏显式的事件间特征交互且未针对事件间差异训练模型而导致的模型重复预测事件或生成语句雷同问题,提出一种基于事件最大边界的密集视频描述(dense video captioning based on event maximal margin,EMM-DVC)方法.事件边界是包含事件间特征相似度、事件在视频中时间位置的距离、生成描述多样性的评分.EMM-DVC通过最大化事件边界,使相似预测结果的距离远且预测结果和实际事件的距离近.另外,EMM-DVC引入事件边界距离损失函数,通过扩大事件边界距离,引导模型关注不同事件.在ActivityNet Captions数据集上的实验证明,EMM-DVC与同类密集视频描述模型相比能生成更具多样性的描述文本,并且与主流密集视频描述模型相比,EMM-DVC在多个指标上达到最优水平.

外文标题：Dense video captioning via maximal event margin

外文摘要：In order to solve the problem of repeated extraction of events or similarity about generated statements due to the lack of explicit event feature interaction and the difficulty in capturing differences between events of dense video captioning methods based on set prediction,a dense video captioning method based on event maximum margin(EMM-DVC)was proposed.The event margin includes the evaluation of feature similarity between events,time stamps distances of events in video and diversity of generated texts.By maximizing the event margin,EMM-DVC expanded the distance between similar predicted results and reduced the distance between predicted results and true events.In addition,EMM-DVC utilized event margin distance loss function to guide the model to focus on different events by expanding the event margin distance.Experiments on the ActivityNet Captions benchmark dataset show that EMM-DVC can generate more diverse texts than other dense video captioning model,and yield superior performance compared with mainstream dense video captioning models according to several evaluation metrics.

外文关键词：

dense video captioningmulti-task learningend-to-end modelensemble prediction

作者：

陈劭武、胡慧君、刘茂福

展开 >

作者单位：

武汉科技大学计算机科学与技术学院, 武汉 430065

智能信息处理与实时工业系统湖北省重点实验室(武汉科技大学), 武汉 430081

关键词：

密集视频描述多任务学习端到端模型集合预测

出版年：

2024

中国科技论文

教育部科技发展中心

中国科技论文

影响因子：0.466

ISSN：2095-2783

年,卷(期)：2024.19(2)

参考文献量27