Fine-grained visual clasificatio based on compct Vision transformer
Vision transformer(ViT)has been widely used in fine-grained vision classification(FGVC),on the basis of which a compact structure is proposed to overcome several problems of excessive data requirement and high computational complexity.Firstly,it employs multi-layer convolutional blocks to generate model input,which retains more underlying information and inductive bias and reduces reliance on data.Secondly,it uses the sequence pooling to eliminate the demand of class tokens and decrease the computational complexity.Finally,it uses the part selection module and mixed loss to further improve performance in FGVC.With fewer data and limited computational resources,the structure can produce superior results(88.9%,87.4%,89.0%,93.4%,88.0%respectively)on common datasets CUB-200-2011,Butterfly200,Stanford Dogs,Stanford Cars and NABirds.The training time decreases by 73.8%compared to ViT-B_16 and 93.9%compared to TransFG averagely,however the parameters are only roughly one-fourth of both.Experiments prove that the model proposed is superior to other well-liked methodologies in terms of data requirements and computational complexity.It can be effectively used in industrial process control,equipment micro fault detection and diagnosis.
compactVision transformerfine-grained visual classificationconvolutional blocksinductive biassequence poolingmixed loss