首页|How Good is Google Bard's Visual Understanding?An Empirical Study on Open Challenges
How Good is Google Bard's Visual Understanding?An Empirical Study on Open Challenges
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
万方数据
Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI.Notably,Bard has recently been updated to handle visual inputs alongside text prompts during conversations.Given Bard's impressive track re-cord in handling textual inputs,we explore its capabilities in understanding and interpreting visual data(images)conditioned by text questions.This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Gener-ative models,especially in addressing complex computer vision problems that demand accurate visual and language understanding.Spe-cifically,in this study,we focus on 15 diverse task scenarios encompassing regular,camouflaged,medical,under-water and remote sens-ing data to comprehensively evaluate Bard's performance.Our primary finding indicates that Bard still struggles in these vision scenari-os,highlighting the significant gap in vision-based understanding that needs to be bridged in future developments.We expect that this empirical study will prove valuable in advancing future models,leading to enhanced capabilities in comprehending and interpreting fine-grained visual data.Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand.
Google Bardmulti-modal understandingvisual comprehensionlarge language modelsconversational AIchatbot
Haotong Qin、Ge-Peng Ji、Salman Khan、Deng-Ping Fan、Fahad Shahbaz Khan、Luc Van Gool