Multimodal sentiment analysis aims to extract and integrate semantic information from text,images,and audio data in order to identify the emotional states of speakers in online videos.Although,multimodal fusion methods have shown definite outcomes in this research area,previous studies have not adequately addressed the distribution differences between modes and the fusion of relational knowledge.Therefore,a multimodal sentiment analysis method is recommended.In this context,this study proposes the design of a Multimodal Prompt Gate(MPG)module.The proposed module can convert nonverbal information into prompts that fuse the context,filter the noise of nonverbal signals using text information,and obtain prompts containing rich semantic information to enhance information integration between the modalities.In addition,a contrastive learning framework from instance to label is proposed.This framework is used to distinguish the different labels in latent space at the semantic level to further optimize the model output.Experiments on three large-scale sentiment analysis datasets are conducted.The results show that the binary classification accuracy of the proposed method improves by approximately 0.7%compared to the suboptimal model,and the ternary classification accuracy improves by more than 2.5%,reaching 0.671.This method can provide a reference for introducing multimodal sentiment analysis in the fields of user profiling,video understanding,and AI interviews.