结合图像-文本提示与跨模态适配器的零样本多标签图像分类

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：最近的零样本多标签图像分类方法主要基于视觉语言预训练模型CLIP(contrastive language-image pre-train-ing).然而,这些工作仅仅在文本提示上进行改进,忽略了图像和文本2 种模态之间的交互.针对以上问题,提出一种结合图像-文本提示和跨模态适配器(image-text prompts and cross-modal adapter,ITPCA)的零样本多标签图像分类方法,充分挖掘视觉语言预训练模型的图文匹配能力.通过结合提示学习为图像和文本分支设计提示,提高了模型对不同标签的泛化能力.此外,设计了一个跨模态适配器建立图像和文本 2 种模态之间的联系.实验结果表明,在NUS-WIDE、MS-COCO多标签数据集上,所提方法优于其他零样本多标签图像分类方法.

外文标题：Zero-shot multi-label image classification with image-text prompts and cross-modal adapter

外文摘要：Recent approaches to zero-shot multi-label image classification primarily rely on the vision and language pre-training model CLIP.However,they only improve text prompts and ignore the interaction between image and text modalities.To address these problems,we propose a zero-shot multi-label image classification method combining image-text prompts and cross-modal adapter(ITPCA)to fully exploit the image matching ability of vision and language pre-training model.By combining prompt learning to design prompts for image and text branches,the generalization ability of the model to different labels is improved.Additionally,a cross-modal adapter is designed to build connections between the image and text modalities.Our experimental results show our method is better compared with zero-shot multi-label image classification methods on NUS-WIDE and MS-COCO multi-label datasets.

外文关键词：

vision and language pre-training modelprompt learningzero-shot learningmulti-label image classification

作者：

宋铁成、黄宇

展开 >

作者单位：

重庆邮电大学通信与信息工程学院,重庆 400065

关键词：

视觉语言预训练模型提示学习零样本学习多标签图像分类

出版年：

2024

DOI：

10.3969/j.issn.1674-8425(z).2024.12.022

重庆理工大学学报

重庆理工大学

重庆理工大学学报

CSTPCD北大核心

影响因子：0.567

ISSN：1674-8425

年,卷(期)：2024.38(23)