The core of the fine-grained visual classification method is to locate the discriminant region in the image.The existing studies have enhanced the long-distance dependence of discriminant regional features by using and improving the vision Transformer method,but most of the methods are only limited to enhancing the attention of the salient discriminant region,ignoring the feature information that can be jointly extracted in the sub-significant discriminant region,which makes it difficult to distinguish different categories with similar local features and has low classification accuracy.There-fore,this paper proposes a joint discriminant region extraction method.Firstly,the candidate discriminant regions of the feature map are divided at the front end of the self-attention module,and the model is guided to extract the discriminant region features with different degrees of significance.Secondly,the bilinear fusion self-attention module is used to extract the joint features of multiple discriminant regions with different degrees of significance,so as to obtain more comprehen-sive discriminant region feature information.Experimental results show that the accuracy of the vision Transformer net-work with the joint discriminant region method on the CUB-200-2011 dataset is 92.7%,which is 2.4 percentage pionts higher than that of the standard vision Transformer method,and surpasses the current optimal fine-grained visual classifi-cation method on the other benchmark datasets.
关键词
细粒度视觉分类/判别区域/视觉Transformer/自注意力机制
Key words
fine grained visual classification/discriminant region/vision Transformer/self-attention mechanism