首页|采用表示分离自编码器的任意说话人语音转换

采用表示分离自编码器的任意说话人语音转换

扫码查看
针对非平行语料库下任意说话人之间的语音转换存在语言内容信息和说话人个性特征难以分离,从而导致语音转换的性能不佳的问题,提出了一种采用表示分离自编码器的语音转换方法 RSAE-VC.该方法将语音信号的说话人个性特征视为时不变,而将内容信息视为时变,利用编码器中的实例归一化和激活引导层将两者进行分离,再由解码器将源语音的内容信息与目标语音的个性特征进行合成,从而生成转换后的语音.实验结果表明,RSAE-VC在梅尔倒谱距离上比现有的AGAIN-VC转换方法平均降低了 3.11%,在基音频率均方根误差上降低了2.41%,MOS分和ABX值分别提升了 5.22%和 8.45%.RSAE-VC方法通过自内容损失进行约束使语音更好地保留内容信息,通过自说话人损失将说话人个性特征更好地从语音中分离,可以确保说话人个性特征尽少地遗留在内容信息中,从而提高语音转换性能.
Any-to-any voice conversion using representation separation auto-encoder
In view of the problem that it was difficult to separate speaker personality characteristics from semantic con-tent information in any-to-any voice conversion under non-parallel corpus,which led to unsatisfied performance,a voice conversion method,called RSAE-VC(representation separation auto-encoder voice conversion)was proposed.The speaker's personality characteristics in the speech were regarded as time invariant and the content information as time variant,and the instance normalization and activation guidance layer were used in the encoder to separate them from each other.Then the content information of the source speech and the personality characteristics of the target one was utilized to synthesize the converted speech by the decoder.The experimental results demonstrate that RSAE-VC has an average reduction of 3.11%and 2.41%in Mel cepstral distance and root mean square error of pitch frequency respectively,and has an increasement of 5.22%in MOS and 8.45%in ABX,compared with the AGAIN-VC(activation guidance and adaptive instance normalization voice conversion)method.In RSAE-VC,self-content loss is applied to make the con-verted speech reserve more content information,and self-speaker loss is used to separate the speaker personality charac-teristics from the speech better,which ensure the speaker personality characteristics be left in the content information as little as possible,and the conversion performance is improved.

voice conversionrepresentation separationadaptive instance normalizationself-content lossself-speaker loss

简志华、章子旭

展开 >

杭州电子科技大学通信工程学院,浙江 杭州 310018

语音转换 表示分离 自适应实例归一化 自内容损失 自说话人损失

国家自然科学基金资助项目国家自然科学基金资助项目

6120130161772166

2024

通信学报
中国通信学会

通信学报

CSTPCD北大核心
影响因子:1.265
ISSN:1000-436X
年,卷(期):2024.45(2)
  • 28