计算机工程与设计2024,Vol.45Issue(4) :1166-1172.DOI:10.16208/j.issn1000-7024.2024.04.028

结合轻量卷积的非自回归语音合成方法

Non-autoregressive speech synthesis method combined with lightweight convolution

钟巧霞 曾碧 林镇涛 林伟
计算机工程与设计2024,Vol.45Issue(4) :1166-1172.DOI:10.16208/j.issn1000-7024.2024.04.028

结合轻量卷积的非自回归语音合成方法

Non-autoregressive speech synthesis method combined with lightweight convolution

钟巧霞 1曾碧 1林镇涛 1林伟1
扫码查看

作者信息

  • 1. 广东工业大学计算机学院,广东广州 510006
  • 折叠

摘要

对如何有效捕捉音素之间的关联及如何合成韵律丰富的音频进行研究,提出一种结合轻量卷积的非自回归语音合成模型LCTTS.引入轻量卷积建立起音素之间的联系,解决发音出错问题.通过添加音高和能量预测器预测生成语音的韵律,解决音频韵律缺乏问题.训练模型获取梅尔频谱,结合预先训练好的声码器转化为音频.实验结果表明,提出的LCTTS模型优于先前提出的SpeedySpeech模型,在Emotional Speech Database数据集上平均意见得分获得2.8%的提升,梅尔倒谱失真测度下降0.15.

Abstract

An effective way was investigated to capture the relationship between phonemes and further synthesize prosody-rich audio.A non-autoregressive speech synthesis model LCTTS was proposed combined with lightweight convolution that first re-solved the problem of pronunciation errors by introducing lightweight convolution to establish the connection between phonemes.The lack of prosody in the audio was addressed by adding pitch and energy predictors to predict the prosody of the generated speech.The model was trained to obtain the Mel spectrum,and the result with the pre-trained vocoder was further combined to convert it into audio.Experimental results show that the proposed LCTTS model is superior to the previously SpeedySpeech model.The mean opinion score on the Emotional Speech Database dataset is improved by 2.8%,and the Mel cepstrum distortion measure is decreased by 0.15.

关键词

语音合成/轻量级卷积/韵律合成/梅尔频谱生成/非自回归方法/深度学习/自然语言处理

Key words

speech synthesis/lightweight convolution/prosodic synthesis/Mel spectrum generation/non-autoregressive methods/deep learning/natural language processing

引用本文复制引用

基金项目

国家自然科学基金项目(62172111)

广东省自然科学基金项目(2019A1515011056)

顺德区核心技术攻关基金项目(2130218003002)

出版年

2024
计算机工程与设计
中国航天科工集团二院706所

计算机工程与设计

CSTPCD北大核心
影响因子:0.617
ISSN:1000-7024
参考文献量15
段落导航相关论文