首页|Few-Shot Counting with Multi-Scale Vision Transformers and Attention Mechanisms
Few-Shot Counting with Multi-Scale Vision Transformers and Attention Mechanisms
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
World Scientific
Object counting is a fundamental task in computer vision, with critical applications in areas such as crowd monitoring and ecological conservation. Traditional methods typically rely on large-scale annotated datasets, which are costly and time-consuming to obtain. Few-shot object counting has emerged as a promising solution, enabling accurate counting with minimal annotated samples. However, in real-world scenarios, objects often exhibit significant scale variations due to factors such as view distortion, varying shooting distances, and inherent size differences. Existing few-shot methods usually struggle to address this challenge effectively. To address these issues, we propose a Scale-Aware Vision Transformer (SAViT) framework. Specifically, we design a multi-scale dilated convolution module in SAViT, which can adaptively adjust convolution kernel sampling rates to handle objects of varying sizes. Additionally, we incorporate a global channel attention mechanism to strengthen the model’s ability to capture robust feature representations, thereby improving detection accuracy. For practical usability, we integrate the Segment Anything Model (SAM) to create an exemplar box selection module, simplifying the process by allowing users to generate precise exemplar boxes with a single line drawn on the target object. Extensive experiments on the FSC-147 dataset demonstrate the effectiveness of our approach, achieving a Mean Absolute Error (MAE) of 8.92 and a Root Mean Squared Error (RMSE) of 31.26. Compared to the state-of-the-art method, CACViT, our model reduces MAE by 0.21 (2.30% improvement) and RMSE by 17.7 (36.15% improvement). Our approach not only provides an effective solution for few-shot object counting but also provides a new practical paradigm for extending few-shot learning to complex vision tasks requiring multi-scale reasoning. The code of our paper is available at https://github.com/BlouseDong/SAViT.