中国科学:信息科学(英文版)2024,Vol.67Issue(12) :75-92.DOI:10.1007/s11432-024-4234-4

Modality-experts coordinated adaptation for large multimodal models

Yan ZHANG Zhong JI Yanwei PANG Jungong HAN Xuelong LI
中国科学:信息科学(英文版)2024,Vol.67Issue(12) :75-92.DOI:10.1007/s11432-024-4234-4

Modality-experts coordinated adaptation for large multimodal models

Yan ZHANG 1Zhong JI 2Yanwei PANG 2Jungong HAN 3Xuelong LI4
扫码查看

作者信息

  • 1. School of Electrical and Information Engineering,Tianjin Key Laboratory of Brain-Inspired Intelligence Technology,Tianjin University,Tianjin 300072,China
  • 2. School of Electrical and Information Engineering,Tianjin Key Laboratory of Brain-Inspired Intelligence Technology,Tianjin University,Tianjin 300072,China;Shanghai Artificial Intelligence Laboratory,Shanghai 200232,China
  • 3. Department of Automation,Tsinghua University,Beijing 100084,China
  • 4. Institute of Artificial Intelligence(TeleAI),China Telecom Corporation Limited,Beijing 100033,China
  • 折叠

Abstract

Driven by the expansion of foundation models and the increasing variety of downstream tasks,parameter-efficient fine-tuning(PEFT)methods have exhibited remarkable efficacy in the unimodal domain,effectively mitigating the consumption of computational resources.Although recent research has shifted at-tention to the multimodal domain and achieved efficient parametric adaptation of large multimodal models(LMMs)for downstream tasks,they still encounter two limitations:(1)low performance;(2)poor compati-bility.This work proposes a modality-experts coordinated adaptation(ModeX)method for the multimodal domain,offering an effective,plug-and-play,and lightweight adaptation architecture for diverse LMMs.Specifically,ModeX adaptively coordinates different modality experts in terms of the types of network struc-ture and input data.Besides,an effective coordinator equipped with a routing algorithm is developed for generating corresponding weights,which centers on leveraging the synergy among multimodal data.Ex-tensive experiments on 15 multimodal downstream benchmarks and five LMMs demonstrate that ModeX is capable of seamlessly adapting to diverse LMMs,outperforms the state-of-the-art PEFT methods and even exhibits superior performance compared with full fine-tuning methods.Notably,on NLVR2 task,ModeX achieves 84.06%accuracy with only 12.0M trainable parameters,outperforming the full fine-tuning by 1.63%.Moreover,our ModeX method demonstrates superior stability and offers higher training efficiency,both in terms of training parameters and training duration.Our source code has been released at https://github.com/zhangy0822/ModeX.

Key words

large multimodal model/multimodal learning/vision-language pretraining/parameter-efficient fine-tuning/adapter/modality expert

引用本文复制引用

出版年

2024
中国科学:信息科学(英文版)
中国科学院

中国科学:信息科学(英文版)

CSTPCDEI
影响因子:0.715
ISSN:1674-733X
段落导航相关论文