首页|Modality-experts coordinated adaptation for large multimodal models

Modality-experts coordinated adaptation for large multimodal models

扫码查看
Modality-experts coordinated adaptation for large multimodal models
Driven by the expansion of foundation models and the increasing variety of downstream tasks,parameter-efficient fine-tuning(PEFT)methods have exhibited remarkable efficacy in the unimodal domain,effectively mitigating the consumption of computational resources.Although recent research has shifted at-tention to the multimodal domain and achieved efficient parametric adaptation of large multimodal models(LMMs)for downstream tasks,they still encounter two limitations:(1)low performance;(2)poor compati-bility.This work proposes a modality-experts coordinated adaptation(ModeX)method for the multimodal domain,offering an effective,plug-and-play,and lightweight adaptation architecture for diverse LMMs.Specifically,ModeX adaptively coordinates different modality experts in terms of the types of network struc-ture and input data.Besides,an effective coordinator equipped with a routing algorithm is developed for generating corresponding weights,which centers on leveraging the synergy among multimodal data.Ex-tensive experiments on 15 multimodal downstream benchmarks and five LMMs demonstrate that ModeX is capable of seamlessly adapting to diverse LMMs,outperforms the state-of-the-art PEFT methods and even exhibits superior performance compared with full fine-tuning methods.Notably,on NLVR2 task,ModeX achieves 84.06%accuracy with only 12.0M trainable parameters,outperforming the full fine-tuning by 1.63%.Moreover,our ModeX method demonstrates superior stability and offers higher training efficiency,both in terms of training parameters and training duration.Our source code has been released at https://github.com/zhangy0822/ModeX.

large multimodal modelmultimodal learningvision-language pretrainingparameter-efficient fine-tuningadaptermodality expert

Yan ZHANG、Zhong JI、Yanwei PANG、Jungong HAN、Xuelong LI

展开 >

School of Electrical and Information Engineering,Tianjin Key Laboratory of Brain-Inspired Intelligence Technology,Tianjin University,Tianjin 300072,China

Shanghai Artificial Intelligence Laboratory,Shanghai 200232,China

Department of Automation,Tsinghua University,Beijing 100084,China

Institute of Artificial Intelligence(TeleAI),China Telecom Corporation Limited,Beijing 100033,China

展开 >

large multimodal model multimodal learning vision-language pretraining parameter-efficient fine-tuning adapter modality expert

2024

中国科学:信息科学(英文版)
中国科学院

中国科学:信息科学(英文版)

CSTPCDEI
影响因子:0.715
ISSN:1674-733X
年,卷(期):2024.67(12)