查看更多>>摘要:In this paper,we introduce InternVL 1.5,an open-source multimodal large language model(MLLM)to bridge the capability gap between open-source and proprietary commercial models in multi-modal understanding.We introduce three simple improvements.(1)Strong vision encoder:we explored a continuous learning strategy for the large-scale vision foundation model—InternViT-6B,boosting its visual understanding capabilities,and making it can be transferred and reused in different LLMs.(2)Dynamic high-resolution:we divide images into tiles ranging from 1 to 40 of 448×448 pixels according to the aspect ratio and resolution of the input images,which supports up to 4K resolution input.(3)High-quality bilingual dataset:we carefully collected a high-quality bilingual dataset that covers common scenes,document images,and annotated them with English and Chinese question-answer pairs,significantly enhancing performance in optical character recognition(OCR)and Chinese-related tasks.We evaluate InternVL 1.5 through a series of benchmarks and comparative studies.Compared to both open-source and proprietary commercial mod-els,InternVL 1.5 shows competitive performance,achieving state-of-the-art results in 8 of 18 multimodal benchmarks.Code and models are available at https://github.com/OpenGVLab/InternVL.
查看更多>>摘要:Large models have recently played a dominant role in natural language processing and multi-modal vision-language learning.However,their effectiveness in text-related visual tasks remains relatively unexplored.In this paper,we conducted a comprehensive evaluation of large multimodal models,such as GPT4V and Gemini,in various text-related visual tasks including text recognition,scene text-centric visual question answering(VQA),document-oriented VQA,key information extraction(KIE),and handwritten mathematical expression recognition(HMER).To facilitate the assessment of optical character recognition(OCR)capabilities in large multimodal models,we propose OCRBench,a comprehensive evaluation bench-mark.OCRBench contains 29 datasets,making it the most comprehensive OCR evaluation benchmark available.Furthermore,our study reveals both the strengths and weaknesses of these models,particularly in handling multilingual text,handwritten text,non-semantic text,and mathematical expression recognition.Most importantly,the baseline results presented in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques.The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.
查看更多>>摘要:Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of vision large language models(VLLMs),existing visual instruction tuning datasets include the following limitations.(1)Instruction annotation quality:despite existing VLLMs exhibiting strong performance,instructions generated by those advanced VLLMs may still suffer from inaccuracies,such as hallucinations.(2)Instructions and image diversity:the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs.To address these challenges,we construct a high-quality,diverse visual instruction tuning dataset MMInstruct,which consists of 973k instructions from 24 domains.There are four instruction types:judgment,multiple-choice,long visual question answering,and short visual question answering.To construct MMInstruct,we propose an instruction generation data engine that leverages GPT-4V,GPT-3.5,and manual correction.Our instruction generation engine enables semi-automatic,low-cost,and multi-domain instruction generation at 1/6 the cost of manual construction.Through extensive experiment validation and ablation experiments,we demonstrate that MMInstruct could significantly improve the performance of VLLMs,e.g.,the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks.The code and data shall be available at https://github.com/yuecao0119/MMInstruct.
查看更多>>摘要:Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models(MLLMs),referring to that the generated text is inconsistent with the image content.To miti-gate hallucinations,existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data.In this paper,we pave a different way,introducing a training-free method named Woodpecker.Like woodpeckers heal trees,it picks out and corrects hallucinations from the generated text.Concretely,Woodpecker consists of five stages:key concept extraction,question formulation,visual knowledge validation,visual claim generation,and hallucination correction.Implemented in a post-remedy manner,Woodpecker can easily serve different MLLMs,while being interpretable by accessing intermediate outputs of the five stages.We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm.On the POPE benchmark,our method obtains a 30.66%/24.33%improve-ment in accuracy over the baseline MiniGPT-4/mPLUG-Owl.The source code is released at https://github.com/BradyFU/Woodpecker.
查看更多>>摘要:In this work,we present DocPedia,a novel large multimodal model(LMM)for versatile OCR-free document understanding,capable of parsing images up to 2560 × 2560 resolution.Unlike existing studies that either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained,our DocPedia directly processes visual input in the frequency domain rather than the pixel space.The unique characteristic enables DocPedia to capture a greater amount of visual and textual information using a limited number of visual tokens.To consistently enhance both the per-ception and comprehension abilities of our DocPedia,we develop a dual-stage training strategy and enrich instructions/annotations of all training tasks covering multiple document types.Extensive quantitative and qualitative experiments are conducted on various publicly available benchmarks and the results confirm the mutual benefits of jointly learning perception and comprehension tasks.The results provide further evidence of the effectiveness and superior performance of our DocPedia over other methods.
查看更多>>摘要:Driven by the expansion of foundation models and the increasing variety of downstream tasks,parameter-efficient fine-tuning(PEFT)methods have exhibited remarkable efficacy in the unimodal domain,effectively mitigating the consumption of computational resources.Although recent research has shifted at-tention to the multimodal domain and achieved efficient parametric adaptation of large multimodal models(LMMs)for downstream tasks,they still encounter two limitations:(1)low performance;(2)poor compati-bility.This work proposes a modality-experts coordinated adaptation(ModeX)method for the multimodal domain,offering an effective,plug-and-play,and lightweight adaptation architecture for diverse LMMs.Specifically,ModeX adaptively coordinates different modality experts in terms of the types of network struc-ture and input data.Besides,an effective coordinator equipped with a routing algorithm is developed for generating corresponding weights,which centers on leveraging the synergy among multimodal data.Ex-tensive experiments on 15 multimodal downstream benchmarks and five LMMs demonstrate that ModeX is capable of seamlessly adapting to diverse LMMs,outperforms the state-of-the-art PEFT methods and even exhibits superior performance compared with full fine-tuning methods.Notably,on NLVR2 task,ModeX achieves 84.06%accuracy with only 12.0M trainable parameters,outperforming the full fine-tuning by 1.63%.Moreover,our ModeX method demonstrates superior stability and offers higher training efficiency,both in terms of training parameters and training duration.Our source code has been released at https://github.com/zhangy0822/ModeX.
查看更多>>摘要:The explosive growth of immersive and metaverse services has driven the demand for extended reality(XR)transmissions across wireless networks.XR 360° video,captured by omnidirectional cameras and supported by interactive sensors,often provides users with unique immersive experiences and real-time interactions.However,the ultra-high data rate and ultra-low latency requirements for XR 360° video transmissions present new signal processing challenges for XR communication systems.This paper pro-vides a comprehensive survey of promising physical layer signal processing technologies for XR communica-tions and systems.These include multiple antenna technologies for XR,mmWave/terahertz waves for XR communication,machine-learning-based XR transmission,and resource allocations for XR communications.Additionally,we propose a novel signal processing and transmission framework that fully exploits the space-time-frequency dimensions of virtual reality communications.Finally,we summarize the current technical challenges in signal processing for XR communications and related systems and discuss future trends in XR communications.
查看更多>>摘要:Large-scale GPU clusters are widely used to speed up both latency-critical(online)and best-effort(offline)deep learning(DL)workloads.However,similar to the common practice,the DL clusters at ByteDance dedicate each GPU to one workload or share workloads in time dimension,leading to very low GPU resource utilization.Existing techniques like NVIDIA MPS provide an opportunity to share multiple workloads in space on widely-deployed NVIDIA GPUs,but it cannot guarantee the performance of online workloads.We present MuxFlow,the first production system that can scale over massive GPUs to support highly efficient space-sharing for DL workloads.MuxFlow introduces a two-level protection mechanism for both memory and computation to guarantee the performance of online workloads.MuxFlow leverages dynamic streaming multiprocessor(SM)allocation to improve the efficiency of offline workloads.Based on our practical error analysis,we design a mixed error-handling mechanism to improve system reliability.MuxFlow has been deployed at ByteDance on more than 18000 GPUs.The deployment results indicate that MuxFlow substantially improves the GPU utilization from 26%to 76%,SM activity from 16%to 33%,and GPU memory usage from 42%to 48%.