Multi-scale local regional attention fusion using visual transformers for fine-grained image classification

扫码查看

原文链接

NETL
NSTL
Springer Nature

外文摘要：Abstract Fine-grained visual classification (FGVC) poses a significant challenge due to the minute differences among visually similar categories. The objects to be distinguished from each other are often difficult to recognize due to the very small differences between them and for human observers. Traditional methods struggle with this task, prompting the development of a multi-scale local regional attention fusion scheme based on Visual Transformers. We utilize Swin Transformer as the backbone to extract fine-grained features, enhancing feature representations through the relevant portions multi-headed attention mechanism. Furthermore, the convolutional forward propagation network module refines global spatial and channel features. Our approach achieves state-of-the-art performance on benchmarks like CUB-200-2011, NABirds, and Oxford 102 Flowers, demonstrating the effectiveness of our multi-scale fusion strategy for FGVC. Our code will be available at https://github.com/LYSongs/RRSA.

作者：

Yusong Li、Bin Xie、Yuling Li、Jiahao Zhang

展开 >

作者单位：

Hebei Normal University

Hebei Normal University||Hebei Normal University||Hebei Key Laboratory of Computational Mathematics and Applications

出版年：

2025

DOI：

10.1007/s00371-024-03721-8

The visual computer

ISSN：0178-2789

年,卷(期)：2025.41(8)

参考文献量48