首页|MMInstruct:a high-quality multi-modal instruction tuning dataset with extensive diversity

MMInstruct:a high-quality multi-modal instruction tuning dataset with extensive diversity

扫码查看
MMInstruct:a high-quality multi-modal instruction tuning dataset with extensive diversity
Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of vision large language models(VLLMs),existing visual instruction tuning datasets include the following limitations.(1)Instruction annotation quality:despite existing VLLMs exhibiting strong performance,instructions generated by those advanced VLLMs may still suffer from inaccuracies,such as hallucinations.(2)Instructions and image diversity:the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs.To address these challenges,we construct a high-quality,diverse visual instruction tuning dataset MMInstruct,which consists of 973k instructions from 24 domains.There are four instruction types:judgment,multiple-choice,long visual question answering,and short visual question answering.To construct MMInstruct,we propose an instruction generation data engine that leverages GPT-4V,GPT-3.5,and manual correction.Our instruction generation engine enables semi-automatic,low-cost,and multi-domain instruction generation at 1/6 the cost of manual construction.Through extensive experiment validation and ablation experiments,we demonstrate that MMInstruct could significantly improve the performance of VLLMs,e.g.,the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks.The code and data shall be available at https://github.com/yuecao0119/MMInstruct.

instruction tuningmulti-modalmulti-domaindatasetvision large language model

Yangzhou LIU、Yue CAO、Zhangwei GAO、Weiyun WANG、Zhe CHEN、Wenhai WANG、Hao TIAN、Lewei LU、Xizhou ZHU、Tong LU、Yu QIAO、Jifeng DAI

展开 >

School of Computer Science,Nanjing University,Nanjing 210023,China

School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China

Shanghai AI Laboratory,Shanghai 200232,China

School of Computer Science,Fudan University,Shanghai 200433,China

Department of Information Engineering,The Chinese University of Hong Kong,Hong Kong 999077,China

SenseTime Research,Shanghai 200233,China

Department of Electronic Engineering,Tsinghua University,Beijing 100084,China

展开 >

instruction tuning multi-modal multi-domain dataset vision large language model

2024

中国科学:信息科学(英文版)
中国科学院

中国科学:信息科学(英文版)

CSTPCDEI
影响因子:0.715
ISSN:1674-733X
年,卷(期):2024.67(12)