通过类别特定帧聚类增强动作显著性的弱监督时序动作检测

Enhancing action discrimination via category-specific frame clustering for weakly-supervised temporal action localization

夏惠芬 ¹詹永照 ²刘洪麟 ³任晓鹏³

扫码查看

作者信息

1. 江苏大学计算机科学与通信工程学院,中国镇江市,212013;常州机电职业技术学院,中国常州市,213164
2. 江苏大学计算机科学与通信工程学院,中国镇江市,212013;大数据泛在感知与智慧农业应用工程研究中心,中国镇江市,212013
3. 江苏大学计算机科学与通信工程学院,中国镇江市,212013
折叠

摘要

时序动作检测任务是指在未裁剪的视频中检测出动作的开始时间和结束时间,并对动作实例进行分类.随着视频中动作类别的增多,现有仅提供视频级别标签的弱监督时序动作检测方法已无法提供足够的监督.单帧标注方法引起了人们兴趣.但现有单帧标注方法仅从视频片段序列的角度对标注的单帧建模,而忽略了标注单帧的动作显著性,并且没有充分考虑它们在同一动作类别中的相关性.考虑到在同一动作类别中,带标注的单帧能表现出独特的外观特征和清晰的动作模式,本文提出一种新颖的通过类别特定帧聚类来增强动作显著性的弱监督时序动作检测方法.该方法采用 K-均值聚类算法对同一动作类别的帧聚合,将其作为该动作类别的特征表示.通过计算每帧与各个动作类别之间的相似度,得到类激活分数.特定于类别的单帧表征建模可以为主线中的视频片段序列建模提供补充性的指导.因此,针对标注的帧和其对应的视频片段序列,提出凸组合融合机制,用于增强动作显著性的一致性特性,从而生成更加鲁棒的类激活序列,进行精确的动作分类和动作定位.由于动作显著性增强的补充指导,该方法优于现有的基于单帧标注的动作检测方法.在THUMOS14、GTEA和BEOID 3个数据集上进行的实验表明,与最新的方法相比,所提方法具有更高的检测性能.

Abstract

Temporal action localization(TAL)is a task of detecting the start and end timestamps of action instances and classifying them in an untrimmed video.As the number of action categories per video increases,existing weakly-supervised TAL(W-TAL)methods with only video-level labels cannot provide sufficient supervision.Single-frame supervision has attracted the interest of researchers.Existing paradigms model single-frame annotations from the perspective of video snippet sequences,neglect action discrimination of annotated frames,and do not pay sufficient attention to their correlations in the same category.Considering a category,the annotated frames exhibit distinctive appearance characteristics or clear action patterns.Thus,a novel method to enhance action discrimination via category-specific frame clustering for W-TAL is proposed.Specifically,the K-means clustering algorithm is employed to aggregate the annotated discriminative frames of the same category,which are regarded as exemplars to exhibit the characteristics of the action category.Then,the class activation scores are obtained by calculating the similarities between a frame and exemplars of various categories.Category-specific representation modeling can provide complimentary guidance to snippet sequence modeling in the mainline.As a result,a convex combination fusion mechanism is presented for annotated frames and snippet sequences to enhance the consistency properties of action discrimination,which can generate a robust class activation sequence for precise action classification and localization.Due to the supplementary guidance of action discriminative enhancement for video snippet sequences,our method outperforms existing single-frame annotation based methods.Experiments conducted on three datasets(THUMOS14,GTEA,and BEOID)show that our method achieves high localization performance compared with state-of-the-art methods.

关键词

弱监督/时序动作检测/单帧标注/类别特定/动作显著性

Key words

Weakly supervised/Temporal action localization/Single-frame annotation/Category-specific/Action discrimination

引用本文复制引用

基金项目

National Natural Science Foundation of China(61672268)

出版年

2024

信息与电子工程前沿(英文)

浙江大学

信息与电子工程前沿(英文)

CSTPCD

影响因子：0.371

ISSN：2095-9184

参考文献量1

段落导航