首页|一种基于对比学习大模型的视觉定位方法

一种基于对比学习大模型的视觉定位方法

扫码查看
一阶段视觉定位方法由于其快速性而受到广泛关注,该方法利用图像与文本的融合特征预测目标框,但是现有方法在特征融合前没有进行图像与文本特征的对齐,限制了视觉定位的精度.为了解决这一问题,本文提出一种基于对比学习大模型的视觉定位方法.该方法采用基于对比学习的大规模预训练模型CLIP(Contrastive Lan-guage-Image Pre-training)提取图像和文本特征,利用Transformer编码器融合图像文本特征,使用多层感知机和融合特征预测目标框.该方法能够解决视觉定位方法上述不足的原因在于:借助CLIP模型的编码器可以提取高度语义对齐的图像和文本特征,同时使用全局注意力交互融合图像与文本的上下文特征.在5个数据集上,对本文提出的方法进行实验验证,实验结果表明:相比于现有视觉定位方法,本文方法取得了综合精度的提升.
A Visual Grounding Method with Contrastive Learning Large Model
The one-stage visual grounding method has received widespread attention due to its speed,which uses fused features of images and text to predict target boxes.However,existing methods do not align image and text features be-fore feature fusion,which limits the accuracy of visual grounding.To solve this problem,this paper proposes a visual grounding method based on contrastive learning large model.This method extracts features of image and text with CLIP(Contrastive Language-Image Pre-training)which is a large-scale pre-trained model based on contrastive learning.It uses Transformer encoders to fuse the image-text features and predicts target boxes using multi-layer perceptron and fused fea-tures.The method can overcome the above shortcomings for the following reasons:It can extract highly aligned image-text features in semantics via the CLIP encoders.Meanwhile,it uses global attention to interactively fuse contextual features of images and text.The proposed method was experimentally validated on five datasets,and the experimental results show that compared to existing visual grounding methods,the proposed method has achieved an improvement in overall accuracy.

visual groundingcontrastive learningTransformerattentionlarge modelalign

陆庆阳、袁广林、朱虹、秦晓燕、薛模根

展开 >

中国人民解放军陆军炮兵防空兵学院研究生大队,安徽 合肥 230031

中国人民解放军陆军炮兵防空兵学院信息工程系,安徽 合肥 230031

偏振光成像探测技术安徽省重点实验室,安徽 合肥 230031

视觉定位 对比学习 变换器 注意力 大模型 对齐

2024

电子学报
中国电子学会

电子学报

CSTPCD北大核心
影响因子:1.237
ISSN:0372-2112
年,卷(期):2024.52(10)