Attention set representation for multiscale measurement of few-shot image classification
Objective The task of image classification based on few-shot learning refers to the training of a machine learning model that can effectively classify target images in the presence of limited target training samples available.The main chal-lenge in few-shot image classification lies in the lack of a sufficient dataset,that is,only a small amount of labeled data is available for model training.Numerous advanced models have been proposed to tackle this challenge.A common and effi-cient strategy is to use deep networks as feature extractors.Deep networks are models that can automatically extract valu-able features from input images.These networks can extract feature vectors from the image by using multilayer convolution and pooling operations.These feature vectors can be used to determine the category of the images and realize the goal of image classification.During model training,the feature extractor gradually learns to extract relevant information related to the category of the image,which can then be used as the feature vector.Using deep networks as feature extractors is a com-mon and efficient strategy for few-shot image classification.Even when trained on limited labeled data,these models can achieve high accuracy by leveraging the power of deep learning.However,in the process of extracting features in the form of vectors,a risk of losing valuable information,including information strongly associated with the specific category,is evi-dent.This risk can result in the disregard of crucial information that could substantially enhance image classification accu-racy.The extracted feature vectors must encompass a maximum amount of category-specific information to enhance the accuracy of classification.This paper introduces a novel rich representation feature extractor(RireFeat)based on the base class to achieve an extensive and comprehensive image representation.Method This paper proposes a feature extractor called RireFeat to achieve highly comprehensive and class-specific feature extraction.RireFeat mainly aims to enhance the exchange and flow of information within the feature extractor,thereby facilitating the extraction of class-related features.Additionally,this method focuses on the multilayer feature vectors before and after the training of the feature extractor to ensure that the positive information for classification is retained during the feature extraction process.RireFeat employs a pyramid-like design that divides the feature extractor into multiple levels.Each level will receive the image coding informa-tion from its upper level,and the obtained information will flow to the next level after several convolution and pooling opera-tions at this level.This hierarchical structure facilitates the transfer and fusion of information between different levels,maximizing the utilization of image extraction information within the feature extractor.The category correlation of feature vectors is subsequently deepened,leading to improved accuracy in image classification.Furthermore,RireFeat demon-strates superior generalization capabilities and can readily adapt to novel image classification tasks.Specifically,this paper starts with the process of feature extraction.Local features related to categories are extracted after the image information tra-verses a multilayered hierarchical structure,while information unrelated to categories is ignored.However,this process may also lead to the removal of certain category-specific information.The rich representation feature extractor(RireFeat),which integrates a small shaping module to add the shaping module at a distance across the hierarchy,is proposed in this paper to address this issue.Therefore,image information can still flow and merge with each other after crossing the hierar-chy.This design enables the network to pay additional attention to changes in features before and after each level,facilitat-ing the effective extraction of local features while disregarding information that is unrelated to the specific category.Conse-quently,this approach notably enhances the classification accuracy.Simultaneously,this paper also introduces the idea of contrastive learning into few-shot image classification and combines it with deep Brownian distance covariance to measure image features from multiple scales to contrastive loss functions.This method aims to bring the embeddings of the same dis-tribution closer while pushing those of different distributions farther away,thereby improving classification accuracy.In the experiment,the SetFeat method was used to extract the feature set for each image.In terms of training,similar to other few-shot image learning methods,the entire network is initially pre-trained and then finetuned in the meta-training stage.In the meta-training phase,the classification is performed by calculating the distance between the query(test)and support(train-ing)sample sets.Result 1-shot and 5-shot classification training are conducted on the standard small sample datasets,such as MiniImageNet,TierdeImageNet,and CUB,to verify the validity of the proposed feature extraction structure.Experimental results show that RireFeat achieves 0.64%and 1.10%higher accuracy than SetFeat in a 1-shot and 5-shot convolution-based backbone network on the MiniImageNet dataset,respectively.The ResNet12-based structure is 1.51%and 1.46%higher than SetFeat in 1-shot and 5-shot cases,respectively.CUB datasets provide gains 0.03%and 0.61%higher than SetFeat at 1-shot and 5-shot,respectively,in convolution-based backbone networks,demonstrating improve-ments of 0.66%and 0.75%over SetFeat in 1-shot and 5-shot scenarios,respectively,in the Resnet 12-based structure.In TieredImageNet evaluation,the convolution-based backbone network architecture achieves 0.21%and 0.38%improve-ment over SetFeat under 1-shot and 5-shot conditions,respectively.Conclusion This paper proposes a rich representation feature extractor(RireFeat)to obtain a rich,comprehensive,and accurate feature representation for few-shot image classi-fication.Different from traditional feature extractors and feature extraction forms,RireFeat increases the flow of informa-tion between feature extraction networks by paying attention to the changes in features before and after network transmis-sion.RireFeat effectively reintegrates the category information lost during feature extraction into the feature representation.In addition,the concept of contrastive learning combined with deep Brownian distance covariance is introduced into the few-shot learning image classification to learn additional categorical representations for each image.Therefore,this extrac-tor can capture highly nuanced differences between images from various categories,resulting in improved classification per-formance.In addition,the feature vector set is extracted from the image to provide strong support for the subsequent classi-fication task.The proposed method achieves high classification accuracy on the MiniImageNet,TieredImageNet,and CUB datasets.Moreover,this paper verifies the universality of the proposed method with the current popular deep learning back-bones,such as convolutional and residual backbones,highlighting its applicability to current state-of-the-art models.