首页|采用双重交换表示分离的任意说话人语音转换

采用双重交换表示分离的任意说话人语音转换

扫码查看
在任意说话人语音转换中,训练阶段通常采用编码器对同一说话人语音进行解耦,然后用解码器进行自重构,而转换阶段的解码器是对源语音的内容信息与目标语音的个性特征进行耦合,因此解码器在转换阶段与训练阶段会存在性能失配现象,影响语音转换性能.对此提出了一种采用双重交换表示分离的语音转换方法DERS-VC(Double Exchange Representation Separation Voice Conversion).该方法在训练阶段的自重构过程中,用同一说话人的语音模拟不同说话人的语音进行自监督训练.训练过程引入转换不变损失和周期循环一致损失,通过双重交换表示分离的循环过程使自重构语音与原始语音更加逼近.实验结果表明,DERS-VC算法在梅尔倒谱距离(Mel-Cepstral Distor-tion,MCD)上比现有的AGAIN-VC(Activation Guidance and Adaptive Instance Normalization Voice Conversion)转换方法平均降低了4.03%,平均意见分(Mean Opinion Score,MOS)提升了3.62%,转换语音质量和相似度都有提高.这说明,通过双重交换表示分离的方法可以更好地训练解码器,实现更好性能的任意说话人之间的语音转换.
Any-to-Any Voice Conversion Using Double Exchange Representation Separation
In any-to-any voice conversion,the encoder was usually utilized to disentangle the same speaker's speech and then the decoder was used for self-reconstruction in the training phase,but the decoder in the conversion phase coupled the content information of source speech and the personality characteristics of target speech. Therefore,there existed perfor-mance mismatch between the decoder in the conversion phase and the training phase,which deteriorated the performance of voice conversion. This paper proposed a voice conversion method named DERS-VC (Double Exchange Representation Sep-aration Voice Conversion) using double exchange representation separation. In self-reconstruction process of the training phase,the proposed method applied the speech of the same speaker to simulate the voice of different target speakers for self-supervised training. Meanwhile,the conversion invariance loss and the cycle consistency loss were introduced,and the cy-cle process of separation was conducted by double exchange representation separation to make the self-reconstructed speech closer to the original speech. The experimental results demonstrated that DERS-VC had an average reduction of 4.03% in MCD (Mel-Cepstral Distortion),and had an increment of 3.62% in MOS (Mean Opinion Score),compared with the AGAIN-VC (Activation Guidance and Adaptive Instance Normalization Voice Conversion) method,and the quality and similarity of converted speech both had been improved. This shows that the method of double exchange representation sepa-ration can decrease the mismatch of the decoder and improve the performance of any-to-any voice conversion.

voice conversionany-to-anydouble exchangerepresentation separation

章子旭、简志华

展开 >

杭州电子科技大学通信工程学院,浙江杭州 310018

语音转换 任意说话人 双重交换 表示分离

国家自然科学基金国家自然科学基金

6120130161772166

2024

电子学报
中国电子学会

电子学报

CSTPCD北大核心
影响因子:1.237
ISSN:0372-2112
年,卷(期):2024.52(6)
  • 4