A Visual Grounding Method with Contrastive Learning Large Model
The one-stage visual grounding method has received widespread attention due to its speed,which uses fused features of images and text to predict target boxes.However,existing methods do not align image and text features be-fore feature fusion,which limits the accuracy of visual grounding.To solve this problem,this paper proposes a visual grounding method based on contrastive learning large model.This method extracts features of image and text with CLIP(Contrastive Language-Image Pre-training)which is a large-scale pre-trained model based on contrastive learning.It uses Transformer encoders to fuse the image-text features and predicts target boxes using multi-layer perceptron and fused fea-tures.The method can overcome the above shortcomings for the following reasons:It can extract highly aligned image-text features in semantics via the CLIP encoders.Meanwhile,it uses global attention to interactively fuse contextual features of images and text.The proposed method was experimentally validated on five datasets,and the experimental results show that compared to existing visual grounding methods,the proposed method has achieved an improvement in overall accuracy.