VisFEM:A Dual View Visual Feature Extraction Model Based on Cross Attention
When using attention based models to process computer vision tasks,the global feature extraction ability of the attention mechanism is weak.Therefore,a cross attention based dual view visual feature ex-traction model VisFEM is proposed.The model adopts an encoder-decoder model architecture,extracts coarse-grained and fine-grained features from dual views through cross attention mechanism,and fuses the output features of different encoders to improve the global feature extraction ability of the model.In the classification of the ImageNet high-definition dataset,the accuracy rate reaches 84.3%,and in the retrieval task,the correct recall rate reaches 0.39.
deep learningcomputer visionencoder-decodercross attention mechanism