Fine-grained Image Classification Based on Bi-level Routing Attention and Feature Fusion
In recent years,Vision Transformer(ViT)has made a breakthrough in the field of image recognition.Its self-attention module can extract discriminative labeling information of different pixel blocks from images,thereby improving the accuracy of image classification.In the field of image classification,fine-grained image classification is difficult to classify due to the characteristics of small feature differences between classes and large feature differences within classes.A fine-grained image classification model based on Bi-level Routing Attention(BRA)is proposed to address the characteristics of small,non-uniform,and imperceptible differences between classes in data distribution in fine-grained image classification.The benchmark backbone network adopts a new visual Transformer model designed with a multi-stage hierarchical architecture as the visual feature extractor,which obtains local and global information as well as multi-scale features.At the same time,feature boosting and fusion modules are introduced to improve the network's learning ability for key features.The experimental results show that the classification accuracy of such model on two fine-grained image datasets,CUB-200-2011 and Stanford Dogs,reaches91.7% and 92.2% .Compared with multiple mainstream fine-grained image classification models,such model has better classification results.