首页|结合图像-文本提示与跨模态适配器的零样本多标签图像分类

结合图像-文本提示与跨模态适配器的零样本多标签图像分类

扫码查看
最近的零样本多标签图像分类方法主要基于视觉语言预训练模型CLIP(contrastive language-image pre-train-ing).然而,这些工作仅仅在文本提示上进行改进,忽略了图像和文本2 种模态之间的交互.针对以上问题,提出一种结合图像-文本提示和跨模态适配器(image-text prompts and cross-modal adapter,ITPCA)的零样本多标签图像分类方法,充分挖掘视觉语言预训练模型的图文匹配能力.通过结合提示学习为图像和文本分支设计提示,提高了模型对不同标签的泛化能力.此外,设计了一个跨模态适配器建立图像和文本 2 种模态之间的联系.实验结果表明,在NUS-WIDE、MS-COCO多标签数据集上,所提方法优于其他零样本多标签图像分类方法.
Zero-shot multi-label image classification with image-text prompts and cross-modal adapter
Recent approaches to zero-shot multi-label image classification primarily rely on the vision and language pre-training model CLIP.However,they only improve text prompts and ignore the interaction between image and text modalities.To address these problems,we propose a zero-shot multi-label image classification method combining image-text prompts and cross-modal adapter(ITPCA)to fully exploit the image matching ability of vision and language pre-training model.By combining prompt learning to design prompts for image and text branches,the generalization ability of the model to different labels is improved.Additionally,a cross-modal adapter is designed to build connections between the image and text modalities.Our experimental results show our method is better compared with zero-shot multi-label image classification methods on NUS-WIDE and MS-COCO multi-label datasets.

vision and language pre-training modelprompt learningzero-shot learningmulti-label image classification

宋铁成、黄宇

展开 >

重庆邮电大学 通信与信息工程学院,重庆 400065

视觉语言预训练模型 提示学习 零样本学习 多标签图像分类

2024

重庆理工大学学报
重庆理工大学

重庆理工大学学报

CSTPCD北大核心
影响因子:0.567
ISSN:1674-8425
年,卷(期):2024.38(23)