Researchers at Tsinghua University Target Artificial Intelligence (CPT: Colorful Prompt Tuning for pre-trained vision-language mod- els)

扫码查看

原文链接

NETL
NSTL

外文摘要：New study results on artificial intelligence have been published. According to news re- porting from Beijing, People's Republic of China, by NewsRx journalists, research stated, "Vision-Language Pre-training (VLP) models have shown promising capabilities in grounding natural language in image data, facilitating a broad range of cross-modal tasks." Financial supporters for this research include National Natural Science Foundation of China. The news journalists obtained a quote from the research from Tsinghua University: "However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VLP models for downstream tasks. To address the challenge, we present Color-based Prompt Tuning (CPT), a novel paradigm for tuning VLP models, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual grounding capabilities of VLP models. Comprehensive experimental results show that CPT achieves state-of-the-art performance on zero/few- shot visual grounding (e.g., 75.1 zero-shot accuracy in RefCOCO evaluation), outperforming fine-tuned and other prompt-tuned models by a large margin. Moreover, CPT can also be easily extended to achieve promising zero/few-shot performance on other vision-language tasks, such as visual relation detection, visual commonsense reasoning and visual question answering."

外文关键词：

Tsinghua UniversityBeijingPeople's Republic of ChinaAsiaArtificial Intelligence

出版年：

2024

DOI：

10.1016/j.aiopen.2024.01.004

Robotics & Machine Learning Daily News

ISSN：

年,卷(期)：2024.(Feb.22)

参考文献量75