How Good is Google Bard's Visual Understanding?An Empirical Study on Open Challenges

扫码查看

原文链接

NETL
NSTL
万方数据

外文摘要：Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI.Notably,Bard has recently been updated to handle visual inputs alongside text prompts during conversations.Given Bard's impressive track re-cord in handling textual inputs,we explore its capabilities in understanding and interpreting visual data(images)conditioned by text questions.This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Gener-ative models,especially in addressing complex computer vision problems that demand accurate visual and language understanding.Spe-cifically,in this study,we focus on 15 diverse task scenarios encompassing regular,camouflaged,medical,under-water and remote sens-ing data to comprehensively evaluate Bard's performance.Our primary finding indicates that Bard still struggles in these vision scenari-os,highlighting the significant gap in vision-based understanding that needs to be bridged in future developments.We expect that this empirical study will prove valuable in advancing future models,leading to enhanced capabilities in comprehending and interpreting fine-grained visual data.Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand.

外文关键词：

Google Bardmulti-modal understandingvisual comprehensionlarge language modelsconversational AIchatbot

作者：

Haotong Qin、Ge-Peng Ji、Salman Khan、Deng-Ping Fan、Fahad Shahbaz Khan、Luc Van Gool

展开 >

作者单位：

Computer Vision Lab(CVL),ETH Zürich,Zürich 8001,Switzerland

College of Engineering,Computing & Cybernetics,Australian National University,Canberra 8105,Australia

Mohamed bin Zayed University of Artificial Intelligence,Abu Dhabi 999041,UAE

出版年：

2023

DOI：

10.1007/s11633-023-1469-x

机器智能研究(英文)

中国科学院自动化所

机器智能研究(英文)

CSTPCDCSCD北大核心EI

影响因子：0.49

ISSN：2731-538X

年,卷(期)：2023.20(5)

参考文献量1