Few-Shot Counting with Multi-Scale Vision Transformers and Attention Mechanisms

扫码查看

原文链接

NETL
NSTL
World Scientific

外文摘要：Object counting is a fundamental task in computer vision, with critical applications in areas such as crowd monitoring and ecological conservation. Traditional methods typically rely on large-scale annotated datasets, which are costly and time-consuming to obtain. Few-shot object counting has emerged as a promising solution, enabling accurate counting with minimal annotated samples. However, in real-world scenarios, objects often exhibit significant scale variations due to factors such as view distortion, varying shooting distances, and inherent size differences. Existing few-shot methods usually struggle to address this challenge effectively. To address these issues, we propose a Scale-Aware Vision Transformer (SAViT) framework. Specifically, we design a multi-scale dilated convolution module in SAViT, which can adaptively adjust convolution kernel sampling rates to handle objects of varying sizes. Additionally, we incorporate a global channel attention mechanism to strengthen the model’s ability to capture robust feature representations, thereby improving detection accuracy. For practical usability, we integrate the Segment Anything Model (SAM) to create an exemplar box selection module, simplifying the process by allowing users to generate precise exemplar boxes with a single line drawn on the target object. Extensive experiments on the FSC-147 dataset demonstrate the effectiveness of our approach, achieving a Mean Absolute Error (MAE) of 8.92 and a Root Mean Squared Error (RMSE) of 31.26. Compared to the state-of-the-art method, CACViT, our model reduces MAE by 0.21 (2.30% improvement) and RMSE by 17.7 (36.15% improvement). Our approach not only provides an effective solution for few-shot object counting but also provides a new practical paradigm for extending few-shot learning to complex vision tasks requiring multi-scale reasoning. The code of our paper is available at https://github.com/BlouseDong/SAViT.

外文关键词：

Few-shot learningvision transformerobject counting

作者：

Xiaopan Chen、Zhiwei Dong、Xiaoke Zhu、Fan Zhang、Caihong Yuan

展开 >

作者单位：

School of Computer and Information Engineering, Henan University, Kaifeng,P. R. China

Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng,P. R. China

Henan Engineering Research Center of Intelligent Technology and Application, Henan University, Kaifeng,P. R. China

出版年：

2025

DOI：

10.1142/S0218001425560014

International journal of pattern recognition and artificial intelligence

ISSN：0218-0014

年,卷(期)：2025.39(8)

参考文献量29