Image retrieval based on self-ensemble Vision Transformer
Vision Transformer has shown excellent performance in image classification tasks.The self-attention mechanism in its structure is able to eliminate the influence of noise in images and extract the key image feature information expression compared with convolution neural network.The task of image retrieval is to extract the feature description vectors from images with high quality to improve retrieval accuracy.In view of this,this paper proposes a feature extraction framework with the Vision Transformer model as the backbone network.Aiming at the multi-layer self-attention structure in the Vision Transformer,the feature output of multiple self-attention layers is integrated into the final image features in a self-ensemble way to improve the retrieval effect of the model.Experiments are conducted in the popular public data sets In-shop Clothes Retrieval and Stanford Online Product,and the results show that the proposed method can effectively improve the retrieval effect of features extracted by Vision Transformer and outperforms six advanced image retrieval methods.
Vision Transformerimage retrievalself-ensembleself-attentionranking loss