Voice Conversion Combining Vector Quantization and CTC Introducing Pre-Trained Representation
Pre-trained models have achieved significant breakthroughs in nonparallel corpus Voice Conversion(VC)via Self-Supervised Pre-trained Representation(SSPR).Features extracted by pre-trained models contain a significant amount of content information owing to the widespread use of SSPR.This study proposes a VC model based on the combination of SSPR Vector Quantization(VQ)and Connectionist Temporal Classification(CTC).It uses the SSPR extracted from a pre-trained model as input to improve the quality of single VC.The effective decoupling of content and speaker representations has become a key issue in VC.Using SSPR as the initial content information,VQ is performed to decouple content and speaker representations from speech.However,performing only VQ discretizes only the content information,thus rendering it difficult to separate pure content representations from speech.To further eliminate speaker-invariant information from the content information,a CTC loss-guided content encoder is proposed.CTC not only serves as an auxiliary network to accelerate model convergence but also its additional text supervision can be jointly optimized with VQ to achieve complementary performance and learn pure-content representations.Speaker representations adopt style-embedding learning,and two representations are used as inputs for VC in the system.The proposed method is evaluated on the open-source CMU dataset and VCTK corpus.Experimental results show that the proposed method achieves an objective Mel-Cepstrum Distortion(MCD)of 8.896 dB,as well as subjective Mean Opinion Score(MOS)of speech naturalness and speaker similarity of 3.29 and 3.22,respectively,both of which are better than those of the baseline model.This method achieves the best performance in terms of VC quality and speaker similarity.