A low-rank cross-modal Transformer for multimodal sentiment analysis
Multimodal sentiment analysis,which extends text-based affective computing to multimo-dal contexts with visual and speech modalities,is an emerging research area.In the pretrain-finetune paradigm,fine-tuning large pre-trained language models is necessary for good performance on multimo-dal sentiment analysis.However,fine-tuning large-scale pretrained language models is still prohibitively expensive and insufficient cross-modal interaction also hinders performance.Therefore,a low-rank cross-modal Transformer(LRCMT)is proposed to address these limitations.Inspired by the low-rank parameter updates exhibited by large pretrained models adapting to natural language tasks,LRCMT in-jects trainable low-rank matrices into frozen layers,significantly reducing trainable parameters while al-lowing dynamic word representations.Moreover,a cross-modal modules is designed where visual and speech modalities interact before fusing with the text.Extensive experiments on benchmarks demon-strate LRCMT's efficiency and effectiveness,achieving comparable or better performance than full fine-tuning by only tuning~0.76%parameters.Furthermore,it also obtains state-of-the-art or competitive results on multiple metrics.Ablations validate that low-rank fine-tuning and sufficient cross-modal in-teraction contribute to LRCMT's strong performance.This paper reduces the fine-tuning cost and pro-vides insights into efficient and effective cross-modal fusion.
multimodalsentiment analysispretrained language modelcross-modal transformer