首页|基于紧凑型Vision transformer的细粒度视觉分类

基于紧凑型Vision transformer的细粒度视觉分类

扫码查看
Vision transformer(ViT)已广泛应用于细粒度视觉分类中,针对其中存在的大数据量需求和高计算复杂度的问题,提出一种紧凑型ViT模型。首先,使用多层卷积块生成模型输入,保留更多底层信息和归纳偏置,减少对数据量的依赖;然后,使用序列池化技术取消分类令牌的使用,减少计算复杂度;最后,使用部位选择模块和混合损失函数,进一步提升模型在细粒度视觉分类中的表现。所提出算法在公共数据集CUB-200-2011、Butterfly200、Stanford Dogs、Stanford Cars和NABirds中均进行了实验验证,在只使用少量的数据和计算资源条件下,分别获得了 88。9%、87。4%、89。0%、93。4%和88。0%的准确率,训练时间平均比常用的ViT-B_16模型下降了 73。8%,同时比TransFG模型下降了 93。9%,并且训练过程中的参数量只有这两种模型的1/4左右。实验结果充分表明,所提出的模型较之其他主流的方法在数据量需求和计算复杂度方面具有明显的优越性,可广泛应用于工业过程控制、设备微小故障检测与诊断中。
Fine-grained visual clasificatio based on compct Vision transformer
Vision transformer(ViT)has been widely used in fine-grained vision classification(FGVC),on the basis of which a compact structure is proposed to overcome several problems of excessive data requirement and high computational complexity.Firstly,it employs multi-layer convolutional blocks to generate model input,which retains more underlying information and inductive bias and reduces reliance on data.Secondly,it uses the sequence pooling to eliminate the demand of class tokens and decrease the computational complexity.Finally,it uses the part selection module and mixed loss to further improve performance in FGVC.With fewer data and limited computational resources,the structure can produce superior results(88.9%,87.4%,89.0%,93.4%,88.0%respectively)on common datasets CUB-200-2011,Butterfly200,Stanford Dogs,Stanford Cars and NABirds.The training time decreases by 73.8%compared to ViT-B_16 and 93.9%compared to TransFG averagely,however the parameters are only roughly one-fourth of both.Experiments prove that the model proposed is superior to other well-liked methodologies in terms of data requirements and computational complexity.It can be effectively used in industrial process control,equipment micro fault detection and diagnosis.

compactVision transformerfine-grained visual classificationconvolutional blocksinductive biassequence poolingmixed loss

徐昊、郭黎、李润泽

展开 >

重庆理工大学计算机科学与工程学院,重庆 400054

湖北民族大学智能科学与工程学院,湖北恩施 445000

南京航空航天大学自动化学院,南京 211106

紧凑型 Vision transformer 细粒度视觉分类 卷积块 归纳偏置 序列池化 混合损失

国家自然科学基金国家自然科学基金

6226301062020106003

2024

控制与决策
东北大学

控制与决策

CSTPCD北大核心
影响因子:1.227
ISSN:1001-0920
年,卷(期):2024.39(3)
  • 27