Spatio-temporal hierarchical query for referring video object segmentation
In this paper,we propose a spatio-temporal hierarchical query-based referring video object segmentation(RVOS)method,called STHQ,to address the problems of lack of spatio-temporal consistency modeling and insufficient learning of spatio-temporal representation of the target in existing RVOS methods.We view the RVOS task as a query-based sequence prediction problem,and propose a two-level query mechanism for spatio-temporal consistency modeling and feature learning of the target.In the first stage,we devise the frame-level spatial information extraction module,which adopts language features as the query to interact independently with each frame of the video sequence in the spatial dimension,and generate instance embeddings containing spatial information about the target.In the second stage,we propose a spatio-temporal information aggregation module.The module uses the video-level learnable queries to interact with the instance embeddings generated in the first stage in the spatio-temporal dimension,and produces the video-level instance embeddings with spatio-temporal representation information.Finally,the video-level instance embeddings are linearly converted into the parameters of conditional convolution,which is used to perform convolution with the features of each frame in the video sequence and generate the mask prediction sequence of the target.The experimental results on three benchmarks show that our proposed STHQ outperforms the existing approaches and achieves state-of-the-art performance.
referring video object segmentationspatio-temporal consistency modelingspatio-temporal feature learningcross-modal feature interactionTransformer