针对基于集合预测的密集视频描述方法由于缺乏显式的事件间特征交互且未针对事件间差异训练模型而导致的模型重复预测事件或生成语句雷同问题,提出一种基于事件最大边界的密集视频描述(dense video captioning based on event maximal margin,EMM-DVC)方法.事件边界是包含事件间特征相似度、事件在视频中时间位置的距离、生成描述多样性的评分.EMM-DVC通过最大化事件边界,使相似预测结果的距离远且预测结果和实际事件的距离近.另外,EMM-DVC引入事件边界距离损失函数,通过扩大事件边界距离,引导模型关注不同事件.在ActivityNet Captions数据集上的实验证明,EMM-DVC与同类密集视频描述模型相比能生成更具多样性的描述文本,并且与主流密集视频描述模型相比,EMM-DVC在多个指标上达到最优水平.
Dense video captioning via maximal event margin
In order to solve the problem of repeated extraction of events or similarity about generated statements due to the lack of explicit event feature interaction and the difficulty in capturing differences between events of dense video captioning methods based on set prediction,a dense video captioning method based on event maximum margin(EMM-DVC)was proposed.The event margin includes the evaluation of feature similarity between events,time stamps distances of events in video and diversity of generated texts.By maximizing the event margin,EMM-DVC expanded the distance between similar predicted results and reduced the distance between predicted results and true events.In addition,EMM-DVC utilized event margin distance loss function to guide the model to focus on different events by expanding the event margin distance.Experiments on the ActivityNet Captions benchmark dataset show that EMM-DVC can generate more diverse texts than other dense video captioning model,and yield superior performance compared with mainstream dense video captioning models according to several evaluation metrics.
dense video captioningmulti-task learningend-to-end modelensemble prediction