Fine Grained Visual Classification Method for Combined Discriminative Region Features
The core of the fine-grained visual classification method is to locate the discriminant region in the image.The existing studies have enhanced the long-distance dependence of discriminant regional features by using and improving the vision Transformer method,but most of the methods are only limited to enhancing the attention of the salient discriminant region,ignoring the feature information that can be jointly extracted in the sub-significant discriminant region,which makes it difficult to distinguish different categories with similar local features and has low classification accuracy.There-fore,this paper proposes a joint discriminant region extraction method.Firstly,the candidate discriminant regions of the feature map are divided at the front end of the self-attention module,and the model is guided to extract the discriminant region features with different degrees of significance.Secondly,the bilinear fusion self-attention module is used to extract the joint features of multiple discriminant regions with different degrees of significance,so as to obtain more comprehen-sive discriminant region feature information.Experimental results show that the accuracy of the vision Transformer net-work with the joint discriminant region method on the CUB-200-2011 dataset is 92.7%,which is 2.4 percentage pionts higher than that of the standard vision Transformer method,and surpasses the current optimal fine-grained visual classifi-cation method on the other benchmark datasets.
fine grained visual classificationdiscriminant regionvision Transformerself-attention mechanism