首页|Multi-scale local regional attention fusion using visual transformers for fine-grained image classification

Multi-scale local regional attention fusion using visual transformers for fine-grained image classification

扫码查看
Abstract Fine-grained visual classification (FGVC) poses a significant challenge due to the minute differences among visually similar categories. The objects to be distinguished from each other are often difficult to recognize due to the very small differences between them and for human observers. Traditional methods struggle with this task, prompting the development of a multi-scale local regional attention fusion scheme based on Visual Transformers. We utilize Swin Transformer as the backbone to extract fine-grained features, enhancing feature representations through the relevant portions multi-headed attention mechanism. Furthermore, the convolutional forward propagation network module refines global spatial and channel features. Our approach achieves state-of-the-art performance on benchmarks like CUB-200-2011, NABirds, and Oxford 102 Flowers, demonstrating the effectiveness of our multi-scale fusion strategy for FGVC. Our code will be available at https://github.com/LYSongs/RRSA.

Yusong Li、Bin Xie、Yuling Li、Jiahao Zhang

展开 >

Hebei Normal University

Hebei Normal University||Hebei Normal University||Hebei Key Laboratory of Computational Mathematics and Applications

2025

The visual computer

The visual computer

ISSN:0178-2789
年,卷(期):2025.41(8)
  • 48