Abstract
In this paper,we introduce InternVL 1.5,an open-source multimodal large language model(MLLM)to bridge the capability gap between open-source and proprietary commercial models in multi-modal understanding.We introduce three simple improvements.(1)Strong vision encoder:we explored a continuous learning strategy for the large-scale vision foundation model—InternViT-6B,boosting its visual understanding capabilities,and making it can be transferred and reused in different LLMs.(2)Dynamic high-resolution:we divide images into tiles ranging from 1 to 40 of 448×448 pixels according to the aspect ratio and resolution of the input images,which supports up to 4K resolution input.(3)High-quality bilingual dataset:we carefully collected a high-quality bilingual dataset that covers common scenes,document images,and annotated them with English and Chinese question-answer pairs,significantly enhancing performance in optical character recognition(OCR)and Chinese-related tasks.We evaluate InternVL 1.5 through a series of benchmarks and comparative studies.Compared to both open-source and proprietary commercial mod-els,InternVL 1.5 shows competitive performance,achieving state-of-the-art results in 8 of 18 multimodal benchmarks.Code and models are available at https://github.com/OpenGVLab/InternVL.