首页|面向域外说话人适应场景的多层级解耦个性化语音合成

面向域外说话人适应场景的多层级解耦个性化语音合成

扫码查看
个性化语音合成任务旨在合成特定说话人音色的语音。传统方法在合成域外说话人语音时,与真实语音存在明显音色差异,解耦说话人特征仍较为困难。本文提出面向训练时未出现的域外说话人适应场景下的多层级解耦个性化语音合成方法,通过不同粒度特征融合,有效提升零资源条件下域外说话人语音合成性能。本文方法采用快速傅里叶卷积提取说话人全局特征,以提高模型对域外说话人的泛化能力,实现句子粒度的说话人解耦;借助语音识别模型解耦音素粒度说话人特征,并通过注意力机制捕捉音素级音色特征,实现音素粒度的说话人解耦。实验结果表明:在公开数据集AISHELL3上,本文方法对域外说话人在客观评价指标说话人特征向量余弦相似度上达到0。697,相比基线模型提高6。25%,有效提升对域外说话人音色特征建模能力。
Multi-level Disentangled Personalized Speech Synthesis for Out-of-Domain Speakers Adaptation Scenarios
Personalized speech synthesis aims to generate speech with specific speaker's characteristics.Traditional approaches often exhibit noticeable timbre disparities when synthesizing speech from unseen speakers,making it challenging to disentangle speaker-specific timbre features.This paper proposes a multi-level disentangled personalized speech synthesis approach designed for out-of-domain speakers.By fusing features at different granularities,the proposed method effectively enhances the performance of synthesizing speech from unseen speakers under zero-resource conditions.This is achieved by utilizing fast Fourier convolution to extract global speaker features,thereby enhancing the model's generalization to unseen speakers and enabling sentence-level speaker decoupling.Additionally,leveraging a speech recognition model,the method decouples speaker features at the phoneme level and captures phoneme-level timbre features through an attention mechanism,achieving phoneme-level speaker disentanglement.Experimental results on the publicly available dataset AISHELL3 demonstrate that the proposed approach achieves a cosine similarity of 0.697 for speaker feature vectors of cross-speaker adaptation,indicating a 6.25%improvement compared with the baseline model.This enhancement shows the method's capability in modeling timbre features for speech from unseen speakers in cross-speaker adaptation scenarios.

speech synthesiszero-shotspeaker representationout-of-domain speakerfeature disentanglement

高盛祥、杨元樟、王琳钦、莫尚斌、余正涛、董凌

展开 >

昆明理工大学信息工程与自动化学院,云南昆明 650500

云南省人工智能重点实验室(昆明理工大学),云南昆明 650500

云南省媒体融合重点实验室(云南日报报业集团),云南昆明 650228

语音合成 零资源 说话人表征 域外说话人 特征解耦

国家自然科学基金国家自然科学基金国家自然科学基金国家自然科学基金云南省高新技术产业发展项目云南省基础研究计划云南省科技人才与平台计划云南省媒体融合重点实验室开放基金云南省重点研发计划云南省重点研发计划

62376111U23A2038861972186U21B2027201606202001AS070014202105AC160018220225702202303AP140008202103AA080015

2024

广西师范大学学报(自然科学版)
广西师范大学

广西师范大学学报(自然科学版)

CSTPCD北大核心
影响因子:0.448
ISSN:1001-6600
年,卷(期):2024.42(4)