Semantics-based local attention visual Transformer method on small datasets
When training from scratch on a small data set,visual Transformer cannot be compared with convolutional neural networks of the same scale.Image-based local attention methods can significantly improve the data efficiency of ViT,but will lose information between distant but related patches.To solve the above problems,this paper proposed a bidirectional parallel local attention visual Transformer method.The method first grouped patches at the feature level and performed local attention within the grouped to compensate for the information loss by exploiting the relationships between patches in the feature space.Secondly,in order to effectively fuse information between patches,it combined semantic-based local attention and image-based local attention in parallel to enhance the performance of the ViT model on small data through bidirectional adaptive learning.Experimental results show that this method achieves 97.93%and 85.80%accuracy on the CIFAR-10 and CIFAR-100 data sets respectively with a calculation amount of 15.2 GFLOPs and a parameter amount of 57.2 M.Compared with other methods,the bidirectional parallel local attention visual Transformer maintains the effectiveness of the attributes required for local attention while enhancing local guidance capabi-lities.
deep learningimage classificationTransformerlocal attentionsemantics-based local attention