首页|结合双层路由感知和散射视觉变换的视觉-语言跟踪方法

结合双层路由感知和散射视觉变换的视觉-语言跟踪方法

扫码查看
针对视觉-语言关系建模中存在感受野有限和特征交互不充分问题,该文提出一种结合双层路由感知和散射视觉变换的视觉-语言跟踪框架(BPSVTrack).首先,设计了一种双层路由感知模块(BRPM),通过将高效的加性注意力(EAA)与双动态自适应模块(DDAM)并行结合起来进行双向交互来扩大感受野,使模型更加高效地整合不同窗口和尺寸之间的特征,从而提高模型在复杂场景中对目标的感知能力.其次,通过引入基于双树复小波变换(DTCWT)的散射视觉变换模块(SVTM),将图像分解为低频和高频信息,以此来捕获图像中目标结构和细粒度信息,从而提高模型在复杂环境下的鲁棒性和准确性.在OTB99,LaSOT,TNL2K 3个跟踪数据集上分别取得了86.1%,64.4%,63.2%的精度,在RefCOCOg数据集上取得了70.21%的准确率,在跟踪和定位方面的性能均优于基准模型.
Vision-Language Tracking Method Combining Bi-level Routing Perception and Scattered Vision Transformation
Considering the issues of limited receptive field and insufficient feature interaction in vision-language tracking framework combineing Bi-level routing Perception and Scattering Visual Trans-formation(BPSVTrack)is proposed in this paper.Initially,a Bi-level Routing Perception Module(BRPM)is designed which combines Efficient Additive Attention(EAA)and Dual Dynamic Adaptive Module(DDAM)in parallel to enable bidirectional interaction for expanding the receptive field.Consequently,enhancing the model's ability to integrate features between different windows and sizes efficiently,thereby improving the model's ability to perceive objects in complex scenes.Secondly,the Scattering Vision Transform Module(SVTM)based on Dual-Tree Complex Wavelet Transform(DTCWT)is introduced to decompose the image into low frequency and high frequency information,aiming to capture the target structure and fine-grained details in the image,thus improving the robustness and accuracy of the model in complex environments.The proposed framework achieves accuracies of 86.1%,64.4%,and 63.2%on OTB99,LaSOT and TNL2K tracking datasets respectively.Moreover,it attains an accuracy of 70.21%on the RefCOCOg dataset,the performance in tracking and locating surpasses that of the baseline model.

Vision-Language Tracking(VLT)Bi-level routing perceptionScattering vision transformEfficient Additive Attention(EAA)Dual dynamic adaptation

刘仲民、李振华、胡文瑾

展开 >

兰州理工大学电气工程与信息工程学院 兰州 730050

甘肃省工业过程先进控制重点实验室 兰州 730050

西北民族大学数学与计算机科学学院 兰州 730030

视觉-语言跟踪 双层路由感知 散射视觉变换 高效的加性注意力 双动态自适应

2024

电子与信息学报
中国科学院电子学研究所 国家自然科学基金委员会信息科学部

电子与信息学报

CSTPCD北大核心
影响因子:1.302
ISSN:1009-5896
年,卷(期):2024.46(11)