首页|Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding

Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding

扫码查看
Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.

ProposalsGroundingFeature extractionImage reconstructionAnnotationsSemanticsTrainingInformation processingDecodingAccuracy

Kefan Tang、Lihuo He、Nannan Wang、Xinbo Gao

展开 >

Visual Information Processing Laboratory, School of Electronic Engineering, Xidian University, Xi'an, China

State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi'an, China

Visual Information Processing Laboratory, School of Electronic Engineering, Xidian University, Xi'an, China|Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China

2025

IEEE transactions on multimedia

IEEE transactions on multimedia

ISSN:
年,卷(期):2025.27(1)
  • 68