首页|基于自集成视觉Transformer的图像检索

基于自集成视觉Transformer的图像检索

扫码查看
视觉Transformer在图像分类任务方面表现出色.相较于卷积神经网络,其结构中的自注意力机制能够有效消除图像中噪声的影响,提取出图像关键特征信息表达.鉴于图像检索任务需要从图像中获取高质量的特征描述向量以提高检索的准确率,提出了一种以视觉Transformer模型为骨干网络的特征提取框架,针对视觉Transformer中的多层自注意力结构,通过自集成的方式融合多个自注意力层的特征输出作为最终图像特征,以提升模型的检索效果.在热门的公开数据集In-shop Clothes Retrieval和Stanford Online Product上进行了实验,结果表明所提方法能够有效提升视觉Transformer提取特征的检索效果,并且优于其他6种先进的图像检索方法.
Image retrieval based on self-ensemble Vision Transformer
Vision Transformer has shown excellent performance in image classification tasks.The self-attention mechanism in its structure is able to eliminate the influence of noise in images and extract the key image feature information expression compared with convolution neural network.The task of image retrieval is to extract the feature description vectors from images with high quality to improve retrieval accuracy.In view of this,this paper proposes a feature extraction framework with the Vision Transformer model as the backbone network.Aiming at the multi-layer self-attention structure in the Vision Transformer,the feature output of multiple self-attention layers is integrated into the final image features in a self-ensemble way to improve the retrieval effect of the model.Experiments are conducted in the popular public data sets In-shop Clothes Retrieval and Stanford Online Product,and the results show that the proposed method can effectively improve the retrieval effect of features extracted by Vision Transformer and outperforms six advanced image retrieval methods.

Vision Transformerimage retrievalself-ensembleself-attentionranking loss

黄溪、王先兵、林海、何涛

展开 >

武汉大学国家网络安全学院空天信息安全与可信计算教育部重点实验室,湖北武汉 430072

中国科学院计算技术研究所,北京 100190

视觉Transformer 图像检索 自集成 自注意力 排序损失

2024

武汉大学学报(工学版)
武汉大学

武汉大学学报(工学版)

CSTPCD北大核心
影响因子:0.621
ISSN:1671-8844
年,卷(期):2024.57(12)