Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding

扫码查看

原文链接

NETL
NSTL
IEEE

外文摘要：Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.

外文关键词：

ProposalsGroundingFeature extractionImage reconstructionAnnotationsSemanticsTrainingInformation processingDecodingAccuracy

作者：

Kefan Tang、Lihuo He、Nannan Wang、Xinbo Gao

展开 >

作者单位：

Visual Information Processing Laboratory, School of Electronic Engineering, Xidian University, Xi'an, China

State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi'an, China

Visual Information Processing Laboratory, School of Electronic Engineering, Xidian University, Xi'an, China|Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China

出版年：

2025

DOI：

10.1109/TMM.2024.3521676

IEEE transactions on multimedia

ISSN：

年,卷(期)：2025.27(1)

参考文献量68