At present,the multimodal speech emotion recognition(SER)dataset is small in scale and contains a large amount of information,resulting in insufficient fitting of the model to each modal information,and the information behind the data cannot be excavated.Aiming at this problem,a multimodal speech emotion classification network based on contrastive learning is proposed.On the one hand,the method of skip connections(SC)is used in the network to effectively solve the problem of network degradation.On the other hand,a new Loss calculation method is proposed by means of contrastive learning(CL)theory to speed up the fitting speed of the model.The model is tested on the IEMOCAP dataset.The unweighted accuracy(UA)is 82.68%,and the weighted accuracy(WA)is 82.35%.According to the experimental results,the superiority of this model is demonstrated.