首页期刊导航|中国科学:信息科学(英文版)
期刊信息/Journal information
中国科学:信息科学(英文版)
中国科学:信息科学(英文版)

周光召

月刊

1674-733X

informatics@scichina.org

010-64015683

100717

北京东黄城根北街16号

中国科学:信息科学(英文版)/Journal Science China Information SciencesCSCDCSTPCDEISCI
查看更多>>《中国科学》是中国科学院主办、中国科学杂志社出版的自然科学专业性学术刊物。《中国科学》任务是反映中国自然科学各学科中的最新科研成果,以促进国内外的学术交流。《中国科学》以论文形式报道中国基础研究和应用研究方面具有创造性的、高水平的和有重要意义的科研成果。在国际学术界,《中国科学》作为代表中国最高水平的学术刊物也受到高度重视。国际上最具有权威的检索刊物SCI,多年来一直收录《中国科学》的论文。1999年《中国科学》夺得国家期刊奖的第一名。
正式出版
收录年代

    How far are we to GPT-4V?Closing the gap to commercial multimodal models with open-source suites

    Zhe CHENWeiyun WANGHao TIANShenglong YE...
    1-18页
    查看更多>>摘要:In this paper,we introduce InternVL 1.5,an open-source multimodal large language model(MLLM)to bridge the capability gap between open-source and proprietary commercial models in multi-modal understanding.We introduce three simple improvements.(1)Strong vision encoder:we explored a continuous learning strategy for the large-scale vision foundation model—InternViT-6B,boosting its visual understanding capabilities,and making it can be transferred and reused in different LLMs.(2)Dynamic high-resolution:we divide images into tiles ranging from 1 to 40 of 448×448 pixels according to the aspect ratio and resolution of the input images,which supports up to 4K resolution input.(3)High-quality bilingual dataset:we carefully collected a high-quality bilingual dataset that covers common scenes,document images,and annotated them with English and Chinese question-answer pairs,significantly enhancing performance in optical character recognition(OCR)and Chinese-related tasks.We evaluate InternVL 1.5 through a series of benchmarks and comparative studies.Compared to both open-source and proprietary commercial mod-els,InternVL 1.5 shows competitive performance,achieving state-of-the-art results in 8 of 18 multimodal benchmarks.Code and models are available at https://github.com/OpenGVLab/InternVL.

    OCRBench:on the hidden mystery of OCR in large multimodal models

    Yuliang LIUZhang LIMingxin HUANGBiao YANG...
    19-31页
    查看更多>>摘要:Large models have recently played a dominant role in natural language processing and multi-modal vision-language learning.However,their effectiveness in text-related visual tasks remains relatively unexplored.In this paper,we conducted a comprehensive evaluation of large multimodal models,such as GPT4V and Gemini,in various text-related visual tasks including text recognition,scene text-centric visual question answering(VQA),document-oriented VQA,key information extraction(KIE),and handwritten mathematical expression recognition(HMER).To facilitate the assessment of optical character recognition(OCR)capabilities in large multimodal models,we propose OCRBench,a comprehensive evaluation bench-mark.OCRBench contains 29 datasets,making it the most comprehensive OCR evaluation benchmark available.Furthermore,our study reveals both the strengths and weaknesses of these models,particularly in handling multilingual text,handwritten text,non-semantic text,and mathematical expression recognition.Most importantly,the baseline results presented in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques.The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.

    MMInstruct:a high-quality multi-modal instruction tuning dataset with extensive diversity

    Yangzhou LIUYue CAOZhangwei GAOWeiyun WANG...
    32-47页
    查看更多>>摘要:Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of vision large language models(VLLMs),existing visual instruction tuning datasets include the following limitations.(1)Instruction annotation quality:despite existing VLLMs exhibiting strong performance,instructions generated by those advanced VLLMs may still suffer from inaccuracies,such as hallucinations.(2)Instructions and image diversity:the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs.To address these challenges,we construct a high-quality,diverse visual instruction tuning dataset MMInstruct,which consists of 973k instructions from 24 domains.There are four instruction types:judgment,multiple-choice,long visual question answering,and short visual question answering.To construct MMInstruct,we propose an instruction generation data engine that leverages GPT-4V,GPT-3.5,and manual correction.Our instruction generation engine enables semi-automatic,low-cost,and multi-domain instruction generation at 1/6 the cost of manual construction.Through extensive experiment validation and ablation experiments,we demonstrate that MMInstruct could significantly improve the performance of VLLMs,e.g.,the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks.The code and data shall be available at https://github.com/yuecao0119/MMInstruct.

    Woodpecker:hallucination correction for multimodal large language models

    Shukang YINChaoyou FUSirui ZHAOTong XU...
    48-60页
    查看更多>>摘要:Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models(MLLMs),referring to that the generated text is inconsistent with the image content.To miti-gate hallucinations,existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data.In this paper,we pave a different way,introducing a training-free method named Woodpecker.Like woodpeckers heal trees,it picks out and corrects hallucinations from the generated text.Concretely,Woodpecker consists of five stages:key concept extraction,question formulation,visual knowledge validation,visual claim generation,and hallucination correction.Implemented in a post-remedy manner,Woodpecker can easily serve different MLLMs,while being interpretable by accessing intermediate outputs of the five stages.We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm.On the POPE benchmark,our method obtains a 30.66%/24.33%improve-ment in accuracy over the baseline MiniGPT-4/mPLUG-Owl.The source code is released at https://github.com/BradyFU/Woodpecker.

    DocPedia:unleashing the power of large multimodal model in the frequency domain for versatile document understanding

    Hao FENGQi LIUHao LIUJingqun TANG...
    61-74页
    查看更多>>摘要:In this work,we present DocPedia,a novel large multimodal model(LMM)for versatile OCR-free document understanding,capable of parsing images up to 2560 × 2560 resolution.Unlike existing studies that either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained,our DocPedia directly processes visual input in the frequency domain rather than the pixel space.The unique characteristic enables DocPedia to capture a greater amount of visual and textual information using a limited number of visual tokens.To consistently enhance both the per-ception and comprehension abilities of our DocPedia,we develop a dual-stage training strategy and enrich instructions/annotations of all training tasks covering multiple document types.Extensive quantitative and qualitative experiments are conducted on various publicly available benchmarks and the results confirm the mutual benefits of jointly learning perception and comprehension tasks.The results provide further evidence of the effectiveness and superior performance of our DocPedia over other methods.

    Modality-experts coordinated adaptation for large multimodal models

    Yan ZHANGZhong JIYanwei PANGJungong HAN...
    75-92页
    查看更多>>摘要:Driven by the expansion of foundation models and the increasing variety of downstream tasks,parameter-efficient fine-tuning(PEFT)methods have exhibited remarkable efficacy in the unimodal domain,effectively mitigating the consumption of computational resources.Although recent research has shifted at-tention to the multimodal domain and achieved efficient parametric adaptation of large multimodal models(LMMs)for downstream tasks,they still encounter two limitations:(1)low performance;(2)poor compati-bility.This work proposes a modality-experts coordinated adaptation(ModeX)method for the multimodal domain,offering an effective,plug-and-play,and lightweight adaptation architecture for diverse LMMs.Specifically,ModeX adaptively coordinates different modality experts in terms of the types of network struc-ture and input data.Besides,an effective coordinator equipped with a routing algorithm is developed for generating corresponding weights,which centers on leveraging the synergy among multimodal data.Ex-tensive experiments on 15 multimodal downstream benchmarks and five LMMs demonstrate that ModeX is capable of seamlessly adapting to diverse LMMs,outperforms the state-of-the-art PEFT methods and even exhibits superior performance compared with full fine-tuning methods.Notably,on NLVR2 task,ModeX achieves 84.06%accuracy with only 12.0M trainable parameters,outperforming the full fine-tuning by 1.63%.Moreover,our ModeX method demonstrates superior stability and offers higher training efficiency,both in terms of training parameters and training duration.Our source code has been released at https://github.com/zhangy0822/ModeX.

    COMET:"cone of experience"enhanced large multimodal model for mathematical problem generation

    Sannyuya LIUJintian FENGZongkai YANGYawei LUO...
    93-94页

    ChemDFM-X:towards large multimodal model for chemistry

    Zihan ZHAOBo CHENJingpiao LILu CHEN...
    95-96页

    Physical layer signal processing for XR communications and systems

    Yongpeng WUMai XUGuangtao ZHAIWenjun ZHANG...
    97-118页
    查看更多>>摘要:The explosive growth of immersive and metaverse services has driven the demand for extended reality(XR)transmissions across wireless networks.XR 360° video,captured by omnidirectional cameras and supported by interactive sensors,often provides users with unique immersive experiences and real-time interactions.However,the ultra-high data rate and ultra-low latency requirements for XR 360° video transmissions present new signal processing challenges for XR communication systems.This paper pro-vides a comprehensive survey of promising physical layer signal processing technologies for XR communica-tions and systems.These include multiple antenna technologies for XR,mmWave/terahertz waves for XR communication,machine-learning-based XR transmission,and resource allocations for XR communications.Additionally,we propose a novel signal processing and transmission framework that fully exploits the space-time-frequency dimensions of virtual reality communications.Finally,we summarize the current technical challenges in signal processing for XR communications and related systems and discuss future trends in XR communications.

    MuxFlow:efficient GPU sharing in production-level clusters with more than 10000 GPUs

    Xuanzhe LIUYihao ZHAOShufan LIUXiang LI...
    119-135页
    查看更多>>摘要:Large-scale GPU clusters are widely used to speed up both latency-critical(online)and best-effort(offline)deep learning(DL)workloads.However,similar to the common practice,the DL clusters at ByteDance dedicate each GPU to one workload or share workloads in time dimension,leading to very low GPU resource utilization.Existing techniques like NVIDIA MPS provide an opportunity to share multiple workloads in space on widely-deployed NVIDIA GPUs,but it cannot guarantee the performance of online workloads.We present MuxFlow,the first production system that can scale over massive GPUs to support highly efficient space-sharing for DL workloads.MuxFlow introduces a two-level protection mechanism for both memory and computation to guarantee the performance of online workloads.MuxFlow leverages dynamic streaming multiprocessor(SM)allocation to improve the efficiency of offline workloads.Based on our practical error analysis,we design a mixed error-handling mechanism to improve system reliability.MuxFlow has been deployed at ByteDance on more than 18000 GPUs.The deployment results indicate that MuxFlow substantially improves the GPU utilization from 26%to 76%,SM activity from 16%to 33%,and GPU memory usage from 42%to 48%.