Any-to-Any Voice Conversion Using Double Exchange Representation Separation
In any-to-any voice conversion,the encoder was usually utilized to disentangle the same speaker's speech and then the decoder was used for self-reconstruction in the training phase,but the decoder in the conversion phase coupled the content information of source speech and the personality characteristics of target speech. Therefore,there existed perfor-mance mismatch between the decoder in the conversion phase and the training phase,which deteriorated the performance of voice conversion. This paper proposed a voice conversion method named DERS-VC (Double Exchange Representation Sep-aration Voice Conversion) using double exchange representation separation. In self-reconstruction process of the training phase,the proposed method applied the speech of the same speaker to simulate the voice of different target speakers for self-supervised training. Meanwhile,the conversion invariance loss and the cycle consistency loss were introduced,and the cy-cle process of separation was conducted by double exchange representation separation to make the self-reconstructed speech closer to the original speech. The experimental results demonstrated that DERS-VC had an average reduction of 4.03% in MCD (Mel-Cepstral Distortion),and had an increment of 3.62% in MOS (Mean Opinion Score),compared with the AGAIN-VC (Activation Guidance and Adaptive Instance Normalization Voice Conversion) method,and the quality and similarity of converted speech both had been improved. This shows that the method of double exchange representation sepa-ration can decrease the mismatch of the decoder and improve the performance of any-to-any voice conversion.