It is a challenge for multimodal emotion recognition task that the confusion of similar emotion categories recognition leads to a decrease in recognition effect.To address this problem,a neural network modeling approach for relational graphs is proposed based on clustering group normalization.Firstly,three modal features are extracted using three different feature extractors and spliced by incorporating speaker encoding,which enriches the feature representation and preserves the original information.Secondly,con-textual information is extracted using Transformer.Finally,after the feature nodes are input into the relational graph convolutional neural network,the nodes are clustered and grouped by clustering and independently normalized to make similar nodes more similar,which alleviates the problem that similar emotions are difficult to delimit.Through experimental validation,the network model can reach an 86.34%F1-score on the IEMOCAP dataset four classification,which verifies the effectiveness of the method in this paper.At present,the model achieves the best performance on this dataset.