Self-knowledge distillation for fine-grained image classification
Objective Fine-grained image classification aims to classify a super-category into multiple sub-categories.This task is more challenging than general image classification due to the subtle inter-class differences and large intra-class varia-tions.The attention mechanism enables the model to focus on the key areas of the input image and the discriminative regional features of the image,which are particularly useful for fine-grained image classification tasks.The attention-based classification model also shows high interpretability.To improve the focus of this model on the image discriminative region,attention-based methods have been applied in fine-grained image classification.Although the current attention-based fine-grained image classification models achieve high classification accuracy,they do not adequately consider the number of model parameters and computational volume.As a result,they cannot be easily deployed on low-resource devices,thus greatly limiting their practical application.The concept of knowledge distillation involves transferring knowledge from a high-accuracy,high-parameter,and computationally expensive large teacher model to a low-parameter and computationally efficient small student model to enhance the performance of the latter and to reduce the cost of model learning.To further reduce the model learning cost,researchers have proposed the self-knowledge distillation method that,unlike traditional knowledge distillation methods,enables models to improve their performance by utilizing their own knowledge instead of relying on teacher networks.However,this method falls short in addressing fine-grained image classification tasks due to its ineffective extraction of discriminative region features from images,which results in unsatisfactory distillation outcomes.To address this issue,we propose a self-knowledge distillation learning method for fine-grained image classification by fus-ing efficient channel attention(ECASKD).Method The proposed method embeds an efficient channel attention mecha-nism into the structure of the self-knowledge distillation framework to effectively extract the discriminative regional features of images.The framework mainly consists of a self-knowledge distillation network with a lightweight backbone and a self-teacher subnetwork and a joint loss with classification loss,knowledge distillation loss,and multi-layer feature-based knowledge distillation loss.First,we introduce the efficient channel attention(ECA)module,propose the ECA-Residual block,and construct the ECA-Residual Network18(ECA-ResNet18)lightweight backbone to improve the extraction of multiscale features in discriminative regions of the input image.Compared with the residual module in the original ResNet18,the ECA-Residual block introduces the ECA module after each batch normalization operation.This module con-sists of two ECA-Residual blocks to form a stage of the ECA-ResNet18 backbone network,enhance the network's focus on discriminative regions of the image,and facilitate the extraction of multiscale features.Unlike ResNet18,which is com-monly used in self-knowledge distillation methods,the proposed backbone is based on the ECA-Residual module,which can significantly enhance the ability of the model to extract multi-scale features while maintaining lightweight and highly efficient computational performance.Second,considering the differences in the importance of different scale features out-put from the backbone network,we design and propose the efficient channel attention bidirectional feature pyramid network(ECA-BiFPN)block that assigns weights to the channels during the feature fusion process to differentiate the contribution of features from various channels to the fine-grained image classification task.Finally,we propose a multi-layer feature-based knowledge distillation loss to enhance the backbone network's learning from the self-teacher subnetwork and to focus on discriminative regions.Result Our proposed method achieves classification accuracies of 76.04%,91.11%,and 87.64%on three publicly available datasets,namely,Caltech-UCSD Birds 200(CUB),Stanford Cars(CAR),and FGVC-Aircraft(AIR).To ensure a comprehensive and objective evaluation,we compared the proposed ECASKD method with 15 other methods,including data-augmentation,auxiliary-network,and attention-based methods.Compared with data-augmentation-based methods,ECASKD improves the accuracy by 3.89%,1.94%,and 4.69%on CUB,CAR,and AIR,respectively,with respect to the state-of-the-art(SOTA)method.Compared to the auxiliary network-based method,ECASKD improves the accuracy by 6.17%,4.93%,and 7.81%on CUB,CAR,and AIR,respectively,with respect to SOTA method.Compared to the joint auxiliary network and data augmentation methods,ECASKD improves the accuracy by 2.63%,1.56%,and 3.66%on CUB,CAR,and AIR,respectively,with respect to SOTA method.In sum,ECASKD demonstrates a better fine-grained image classification performance compared with the joint auxiliary network and data aug-mentation methods even without data augmentation.Compared with the attention-based self-knowledge distillation method,ECASKD improves about 23.28%,8.17%,and 14.02%on CUB,CAR and AIR,respectively,with respect to SOTA method.In sum,the ECASKD method outperforms all three types of self-knowledge distillation methods and demonstrates a better fine-grained image classification performance.We also compare this method with four mainstream modeling meth-ods in terms of the number of parameters,floating-point operations(FLOPs),and TOP-1 classification accuracy.Com-pared with ResNet18,the ECA-ResNet18 backbone used in the proposed method significantly improves the classification accuracy with an increase of only 0.4 M parameters and 0.2 G FLOPs.Compared with the larger-scale ResNet50,the per-formance of the proposed method is less than one-half of that of ResNet50 in terms of number of parameters and computa-tion,but its classification accuracy on the CAR dataset differs from ResNet50 by only 0.6%.Compared with the larger ViT-Base and Swin-Transformer-B,the proposed method is about one-eighth of both in terms of number of parameters and com-putation,and its classification accuracies on the CAR and AIR datasets are 3.7%and 5.3%lower than the optimal Swin-Transformer-B.These results demonstrate that the classification accuracy of the proposed method is significantly improved with only a small increase in model complexity.Conclusion The proposed self-knowledge distillation fine-grained image classification method achieves good performance results with 11.9 M parameters and 2.0 G FLOPs,and its lightweight net-work model is suitable for edge computing applications for embedded devices.
fine-grained image classificationchannel attentionknowledge distillation(KD)self-knowledge distillation(SKD)feature fusionconvolutional neural network(CNN)lightweight model