Semi-supervised Emotional Music Generation Method Based on Improved Gaussian Mixture Variational Autoencoders
Music can transmit audio content and emotions through serialized audio features.Emotion is an important component in the semantic expression of music.Therefore,music generation technology should not only consider the structural information of music but also incorporate emotions.Most existing emotional music generation technologies use the complete supervised methods based on emotion labeling.However,the music field lacks a large number of standard emotional labeling datasets,and emotional labels are insufficient to express the emotional features of music.To solve these problems,this paper proposes a semi-supervised emotional music generation method(Semg-GMVAE)based on improved Gaussian mixture variational autoencoders(GMVAE),which connects the rhythm features and mode features of music with emotions,incorporates a feature disentanglement mechanism into GMVAE to learn the potential variable representations of these two features,and performs semi-supervised clustering infe-rence on them.Finally,by manipulating the feature representation of music,our model can achieve music generation and emotion switching on happy,tense,sad,and calm emotions.Meanwhile,this paper conducts a series of experiments on the problem that GMVAE is difficult to distinguish different emotional categories of data.The key reason for the problem is that the variance regu-larization term and mutual information suppression term in the evidence lower bound of GMVAE make the Gaussian components of each category less dispersed,thus affecting the performance of learned representation and the quality of generation.Therefore,Semg-GMVAE penalizes and augments these two factors respectively,and uses Transformer-XL as the encoder and decoder to enhance the modeling capabilities on long sequence music.Experimental results based on real data show that,compared to existing methods,Semg-GMVAE achieves better separation of music with different emotions in potential space,enhances the correlation between music and emotions,effectively disentangles different music features,and finally achieves better emotional music genera-tion and emotion switching by changing the feature representation.
Emotional music generationSemi-supervised generative modelsDisentangled representation learningGaussian mix-ture variational autoencodersTransformer-XL