MMInstruct:a high-quality multi-modal instruction tuning dataset with extensive diversity

Yangzhou LIU ¹Yue CAO ¹Zhangwei GAO ²Weiyun WANG ³Zhe CHEN ⁴Wenhai WANG ⁵Hao TIAN ⁶Lewei LU ⁶Xizhou ZHU ⁷Tong LU ¹Yu QIAO ⁸Jifeng DAI⁹

扫码查看

作者信息

1. School of Computer Science,Nanjing University,Nanjing 210023,China
2. School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China;Shanghai AI Laboratory,Shanghai 200232,China
3. School of Computer Science,Fudan University,Shanghai 200433,China;Shanghai AI Laboratory,Shanghai 200232,China
4. School of Computer Science,Nanjing University,Nanjing 210023,China;Shanghai AI Laboratory,Shanghai 200232,China
5. Department of Information Engineering,The Chinese University of Hong Kong,Hong Kong 999077,China;Shanghai AI Laboratory,Shanghai 200232,China
6. SenseTime Research,Shanghai 200233,China
7. Department of Electronic Engineering,Tsinghua University,Beijing 100084,China;Shanghai AI Laboratory,Shanghai 200232,China;SenseTime Research,Shanghai 200233,China
8. Shanghai AI Laboratory,Shanghai 200232,China
9. Department of Electronic Engineering,Tsinghua University,Beijing 100084,China;Shanghai AI Laboratory,Shanghai 200232,China
折叠

Abstract

Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of vision large language models(VLLMs),existing visual instruction tuning datasets include the following limitations.(1)Instruction annotation quality:despite existing VLLMs exhibiting strong performance,instructions generated by those advanced VLLMs may still suffer from inaccuracies,such as hallucinations.(2)Instructions and image diversity:the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs.To address these challenges,we construct a high-quality,diverse visual instruction tuning dataset MMInstruct,which consists of 973k instructions from 24 domains.There are four instruction types:judgment,multiple-choice,long visual question answering,and short visual question answering.To construct MMInstruct,we propose an instruction generation data engine that leverages GPT-4V,GPT-3.5,and manual correction.Our instruction generation engine enables semi-automatic,low-cost,and multi-domain instruction generation at 1/6 the cost of manual construction.Through extensive experiment validation and ablation experiments,we demonstrate that MMInstruct could significantly improve the performance of VLLMs,e.g.,the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks.The code and data shall be available at https://github.com/yuecao0119/MMInstruct.

Key words

instruction tuning/multi-modal/multi-domain/dataset/vision large language model

引用本文复制引用

出版年

2024

中国科学:信息科学(英文版)

中国科学院

中国科学:信息科学(英文版)

CSTPCDEI

影响因子：0.715

ISSN：1674-733X

段落导航