首页|基于改进高斯混合变分自编码器的半监督情感音乐生成

基于改进高斯混合变分自编码器的半监督情感音乐生成

扫码查看
音乐可以通过序列化的声音信息传递声音内容和情感.情感是音乐所表达的语义中的重要组成部分,因此,音乐生成技术不仅要考虑音乐的结构信息,还应融入情感元素.现有的情感音乐生成技术大多采用基于情感标注的完全监督方法,但音乐领域缺乏大量标准的情感标注数据集,且情感标签不足以表达音乐的情感特征.针对上述问题,提出了基于改进的高斯混合变分自编码器(Gaussian Mixture Variational Autoencoders,GMVAE)的半监督情感音乐生成方法(Semg-GMVAE),将音乐的节奏特征和调式特征与情感建立联系,同时向GMVAE中引入一种特征解纠缠机制来分别学习这两种特征的潜在变量表示,并对其进行半监督聚类推断.最后通过操纵音乐的特征表示,实现了针对快乐、紧张、悲伤、平静情感的音乐生成与情感转换.同时,针对GMVAE难以区分不同情感类别数据的问题,实验指出其关键原因是GMVAE证据下界中的方差正则项与互信息抑制项使得各类别的高斯分量分散性不足,从而影响学习表示的性能和生成的数据样本的情感质量.因此,Semg-GMVAE对这两项因子分别进行了惩罚和增强,并使用Transformer-XL作为编码器和解码器以提升在长序列音乐上的建模能力.基于真实数据集的实验结果表明,相比现有方法,Semg-GMVAE能够将不同情感的音乐在潜在空间中更好地分离,增强了音乐与情感的关联程度,并且能够有效对不同音乐特征进行解纠缠分离,最后通过改变特征表示更好地实现情感音乐生成或情感切换.
Semi-supervised Emotional Music Generation Method Based on Improved Gaussian Mixture Variational Autoencoders
Music can transmit audio content and emotions through serialized audio features.Emotion is an important component in the semantic expression of music.Therefore,music generation technology should not only consider the structural information of music but also incorporate emotions.Most existing emotional music generation technologies use the complete supervised methods based on emotion labeling.However,the music field lacks a large number of standard emotional labeling datasets,and emotional labels are insufficient to express the emotional features of music.To solve these problems,this paper proposes a semi-supervised emotional music generation method(Semg-GMVAE)based on improved Gaussian mixture variational autoencoders(GMVAE),which connects the rhythm features and mode features of music with emotions,incorporates a feature disentanglement mechanism into GMVAE to learn the potential variable representations of these two features,and performs semi-supervised clustering infe-rence on them.Finally,by manipulating the feature representation of music,our model can achieve music generation and emotion switching on happy,tense,sad,and calm emotions.Meanwhile,this paper conducts a series of experiments on the problem that GMVAE is difficult to distinguish different emotional categories of data.The key reason for the problem is that the variance regu-larization term and mutual information suppression term in the evidence lower bound of GMVAE make the Gaussian components of each category less dispersed,thus affecting the performance of learned representation and the quality of generation.Therefore,Semg-GMVAE penalizes and augments these two factors respectively,and uses Transformer-XL as the encoder and decoder to enhance the modeling capabilities on long sequence music.Experimental results based on real data show that,compared to existing methods,Semg-GMVAE achieves better separation of music with different emotions in potential space,enhances the correlation between music and emotions,effectively disentangles different music features,and finally achieves better emotional music genera-tion and emotion switching by changing the feature representation.

Emotional music generationSemi-supervised generative modelsDisentangled representation learningGaussian mix-ture variational autoencodersTransformer-XL

胥备、刘桐

展开 >

南京邮电大学计算机学院 南京 210023

江苏大数据安全与智能处理重点实验室 南京 210023

情感音乐生成 半监督生成模型 解纠缠表示学习 高斯混合变分自编码器 Transformer-XL

江苏省高校自然科学基金面上项目

21KJB520017

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(8)