最近的零样本多标签图像分类方法主要基于视觉语言预训练模型CLIP(contrastive language-image pre-train-ing).然而,这些工作仅仅在文本提示上进行改进,忽略了图像和文本2 种模态之间的交互.针对以上问题,提出一种结合图像-文本提示和跨模态适配器(image-text prompts and cross-modal adapter,ITPCA)的零样本多标签图像分类方法,充分挖掘视觉语言预训练模型的图文匹配能力.通过结合提示学习为图像和文本分支设计提示,提高了模型对不同标签的泛化能力.此外,设计了一个跨模态适配器建立图像和文本 2 种模态之间的联系.实验结果表明,在NUS-WIDE、MS-COCO多标签数据集上,所提方法优于其他零样本多标签图像分类方法.
Zero-shot multi-label image classification with image-text prompts and cross-modal adapter
Recent approaches to zero-shot multi-label image classification primarily rely on the vision and language pre-training model CLIP.However,they only improve text prompts and ignore the interaction between image and text modalities.To address these problems,we propose a zero-shot multi-label image classification method combining image-text prompts and cross-modal adapter(ITPCA)to fully exploit the image matching ability of vision and language pre-training model.By combining prompt learning to design prompts for image and text branches,the generalization ability of the model to different labels is improved.Additionally,a cross-modal adapter is designed to build connections between the image and text modalities.Our experimental results show our method is better compared with zero-shot multi-label image classification methods on NUS-WIDE and MS-COCO multi-label datasets.
vision and language pre-training modelprompt learningzero-shot learningmulti-label image classification