重庆理工大学学报2024,Vol.38Issue(23) :182-188.DOI:10.3969/j.issn.1674-8425(z).2024.12.022

结合图像-文本提示与跨模态适配器的零样本多标签图像分类

Zero-shot multi-label image classification with image-text prompts and cross-modal adapter

宋铁成 黄宇
重庆理工大学学报2024,Vol.38Issue(23) :182-188.DOI:10.3969/j.issn.1674-8425(z).2024.12.022

结合图像-文本提示与跨模态适配器的零样本多标签图像分类

Zero-shot multi-label image classification with image-text prompts and cross-modal adapter

宋铁成 1黄宇1
扫码查看

作者信息

  • 1. 重庆邮电大学 通信与信息工程学院,重庆 400065
  • 折叠

摘要

最近的零样本多标签图像分类方法主要基于视觉语言预训练模型CLIP(contrastive language-image pre-train-ing).然而,这些工作仅仅在文本提示上进行改进,忽略了图像和文本2 种模态之间的交互.针对以上问题,提出一种结合图像-文本提示和跨模态适配器(image-text prompts and cross-modal adapter,ITPCA)的零样本多标签图像分类方法,充分挖掘视觉语言预训练模型的图文匹配能力.通过结合提示学习为图像和文本分支设计提示,提高了模型对不同标签的泛化能力.此外,设计了一个跨模态适配器建立图像和文本 2 种模态之间的联系.实验结果表明,在NUS-WIDE、MS-COCO多标签数据集上,所提方法优于其他零样本多标签图像分类方法.

Abstract

Recent approaches to zero-shot multi-label image classification primarily rely on the vision and language pre-training model CLIP.However,they only improve text prompts and ignore the interaction between image and text modalities.To address these problems,we propose a zero-shot multi-label image classification method combining image-text prompts and cross-modal adapter(ITPCA)to fully exploit the image matching ability of vision and language pre-training model.By combining prompt learning to design prompts for image and text branches,the generalization ability of the model to different labels is improved.Additionally,a cross-modal adapter is designed to build connections between the image and text modalities.Our experimental results show our method is better compared with zero-shot multi-label image classification methods on NUS-WIDE and MS-COCO multi-label datasets.

关键词

视觉语言预训练模型/提示学习/零样本学习/多标签图像分类

Key words

vision and language pre-training model/prompt learning/zero-shot learning/multi-label image classification

引用本文复制引用

出版年

2024
重庆理工大学学报
重庆理工大学

重庆理工大学学报

CSTPCD北大核心
影响因子:0.567
ISSN:1674-8425
段落导航相关论文