Fine-Grained Image Classification Based on Feature Fusion and Ensemble Learning
Fine-grained image classification aims to recognize subcategories within a given superclass accurately;however,it is faced with challenges of large intra-class differences,small inter-class differences,and limited training samples.Most current methods are improved based on Vision Transformer with the goal of enhancing classification performance.However,the following issues occur:ignoring the complementary information of classification tokens from different layers leads to incomplete global feature extraction,inconsistent performance of different heads in multi-head self-attention mechanism leads to inaccurate part localization,and limited training samples are prone to overfitting.In this study,a fine-grained image classification network based on feature fusion and ensemble learning is proposed to address the above issues.The network consists of three modules:the multi-level feature fusion module integrates complementary information to obtain more complete global features,the multi-expert part voting module votes for part tokens through ensemble learning to enhance the representation ability of part features,the attention-guided mixup augmentation module alleviates the overfitting issue and improves the classification accuracy.The classification accuracy on CUB-200-2011,Stanford Dogs,NABirds,and IP102 datasets is 91.92%,93.10%,90.98%,and 76.21%,respectively,with improvements of 1.42,1.50,1.08,and 2.81 percentage points,respectively,compared to the original Vision Transformer model,performing better than other compared fine-grained image classification methods.