上下文建模与推理的视频异常事件检测

Context Modeling and Reasoning for Video Abnormal Event Detection

孙澈 ¹武玉伟 ²贾云得²

扫码查看

作者信息

1. 北京理工大学计算机学院智能信息技术北京市重点实验室北京 100081
2. 深圳北理莫斯科大学广东省智能感知与计算重点实验室广东深圳 518172;北京理工大学计算机学院智能信息技术北京市重点实验室北京 100081
折叠

摘要

视频异常事件检测旨在从视频中自动地检测出不符合正常事件规律的视频事件.视频中许多正常和异常的事件是由目标与场景或其它目标交互而产生的,即它们是以目标为中心且高度上下文相关的.如何从底层的视频特征中提取事件高层语义上下文信息,并根据上下文信息进行视频异常事件检测仍是一个开放的难题.为此,本文提出了一种新的上下文建模与推理的视频异常事件检测方法.本文方法通过建立视频的上下文图,自动地推理事件相关的语义上下文信息,以缩小底层视觉特征与异常事件高层语义之间的差距,实现异常事件检测.具体来说,首先使用了预训练的目标检测网络,提取目标初始的表观特征、目标之间的时空关系特征和场景特征;其次设计了一个上下文图推理模块,通过建模时空上下文图,将提取到的特征显式地建模为三类语义上下文,包括事件目标的个体行为、不同目标之间的时空关系以及目标与场景之间的交互,其中图的节点表示目标/场景,图的边表示时空关系;最后构建了一个异常预测模块,根据推理到的语义上下文信息进行异常事件检测.本文的上下文图推理模块基于平均场理论,通过使用多个带有消息传递模块的循环神经网络,迭代更新图的节点和边的状态,目的是从底层的视觉特征中推理得到高层的语义上下文.本文的异常预测模块包括注意力池化网络层和全连接网络层,通过输入语义上下文信息,计算视频帧的异常分数,从而正确地进行异常事件检测.实验中,设计了一个自训练策略,分别使用了无监督、半监督、弱监督和监督四种训练策略,以端到端的方式训练时空上下文图推理模块和异常预测模块.本文方法在四个公开的数据集上进行了实验,包括三个半监督的数据集Subway(Entrance/Exit)、Avenue和ShanghaiTech,以及一个监督的数据集UCF-Crime.与不使用上下文的方法相比,本文方法在Subway(Entrance/Exit)、Avenue和ShanghaiTech数据集上的无监督AUC指标分别提高了 2.7％/3.1％、2.0％和2.9％,半监督AUC指标分别提高了 3.5％/3.3％、4.0％和4.3％.在监督数据集UCF-Crime上,与没有使用上下文的方法相比,本文方法在半监督AUC、弱监督AUC和监督AUC的指标上分别提高了 2.1％、0.4％和9.2％,取得了有竞争力的表现.

Abstract

Video abnormal event detection aims to automatically detect events that do not conform to the regularities of normal events in videos.Many normal events and abnormal events in videos are caused by the interactions between event objects and scenes or other objects,and thus they are usually object-centric and highly contextual.Currently,it is still an open problem to discriminate abnormal events by acquiring high-level semantic context information from low-level visual features in videos.To this end,we propose a novel context modeling and reasoning method for video abnormal event detection.The method mines event-related semantic context information from video data by generating video context graphs,which is able to narrow the semantic gap between the low-level visual features in videos and the high-level semantics of abnormal events,and then uses the semantic context information to discriminate abnormal events correctly in videos.Specifically,we first use a pre-trained object detection neural network to extract the initial appearance features of all objects,the spatio-temporal relationship features between different objects,as well as the scene features.Then we devise a context graph inference module to explicitly model three types of semantic contexts,including individual object behaviors,pairwise relationships among different objects,and interactions between objects and scenes,where the nodes of the graph could describe the object and scene features,and the edges of the graph describe the spatio-temporal relationship features.We finally build an anomaly prediction module to discriminate abnormal events according to the semantic contexts captured from the previous context graph in videos.The proposed context graph inference module is based on the mean-field theory,and includes multiple recurrent neural networks with message-passing modules.The message-passing modules iteratively update the state of nodes and edges in the context graph for inferring the high-level semantic contexts from the low-level feature representations.The proposed anomaly prediction module consists of two attention-pooling network layers and one fully-connected network layer.The obtained context information is finally fed into the anomaly prediction module to calculate anomaly scores of all video frames for video abnormal event detection.In experiments,we introduce a self-training strategy to train the network models in four manners,including unsupervised,semi-supervised,weakly supervised and supervised manners.In this way,the spatio-temporal context graph inference module and anomaly prediction module are trained in an end-to-end manner seamlessly,such that they reinforce each other.The context reasoning method is evaluated on four public challenging datasets,including three semi-supervised datasets,i.e.,the Subway(Entrance/Exit)dataset,Avenue dataset and ShanghaiTech dataset,as well as a supervised UCF-Crime dataset,respectively.Compared with existing methods without considering context modeling and reasoning,our context modeling and reasoning method improves the unsupervised AUC values by 2.7％/3.1％,2.0％and 2.9％on the Subway(Entrance/Exit)dataset,Avenue dataset and ShanghaiTech dataset,and improves the semi-supervised AUC values by 3.5％/3.3％,4.0％and 4.3％,respectively.Compared with existing methods without considering context modeling and reasoning on the supervised UCF-Crime dataset,our method significantly improves the semi-supervised,weakly-supervised and supervised AUC values by 2.1％,0.4％and 9.2％,respectively.

关键词

异常事件检测/上下文建模与推理/上下文图/自训练策略/深度学习

Key words

abnormal event detection/context modeling and reasoning/context graph/self-training strategy/deep learning

引用本文复制引用

基金项目

深圳市自然科学基金面上项目(JCYJ20230807142703006)

广东省教育厅普通高校重点科研平台和项目(2023ZDZX1034)

出版年

2024

计算机学报

中国计算机学会中国科学院计算技术研究所

计算机学报

CSTPCDCSCD北大核心

影响因子：3.18

ISSN：0254-4164

参考文献量48

段落导航