基于层次化Conformer的语音合成

扫码查看

原文链接

万方数据
维普

中文摘要：语音合成需要将输入语句的文本转换为包含音素、单词和语句的语音信号.现有语音合成方法将语句看作一个整体,难以准确地合成出不同长度的语音信号.通过分析语音信号中蕴含的层次化关系,分别设计基于Conformer的层次化文本编码器和基于Conformer的层次化语音编码器,并提出了一种基于层次化文本-语音Conformer的语音合成模型.首先,该模型根据输入文本信号的长度,构建层次化文本编码器,包括音素级、单词级、语句级文本编码器3个层次,不同层次的文本编码器描述不同长度的文本信息;并使用Conformer的注意力机制来学习该长度信号中不同时间特征之间的关系.利用层次化的文本编码器,能够找出语句中不同长度需要强调的信息,有效实现不同长度的文本特征提取,缓解合成的语音信号持续时间长度不确定的问题.其次,层次化语音编码器包括音素级、单词级、语句级语音编码器3个层次.每个层次的语音编码器将文本特征作为Conformer的查询向量,将语音特征作为Conformer的关键字向量和值向量,来提取文本特征和语音特征的匹配关系.利用层次化的语音编码器和文本语音匹配关系,可以缓解不同长度语音信号合成不准确的问题.所提模型的层次化文本-语音编码器可以灵活地嵌入现有的多种解码器中,通过文本和语音之间的互补,提供更为可靠的语音合成结果.在LJSpeech和LibriTTS两个数据集上进行实验验证,实验结果表明,所提方法的梅尔倒谱失真小于现有语音合成方法.

外文标题：Hierarchical Conformer Based Speech Synthesis

外文摘要：Speech synthesis requires synthesizing the input speech text into a speech signal containing phonemes,words and utte-rances.Existing speech synthesis methods consider utterance as a whole,and it is difficult to synthesize different lengths of speech signals accurately.In this paper,we analyze the hierarchical relationships embedded in speech signals,design a Conformer-based hierarchical text encoder and a Conformer-based hierarchical speech encoder,and propose a speech synthesis model based on the hierarchical text-speech Conformer.First,the model constructs hierarchical text encoders according to the length of the input text signal,including three levels of phoneme level,word level,and utterance level text encoders.Each level of text encoder,de-scribes text information of different lengths and uses Conformer's attention mechanism to learn the relationship between different temporal features in the signal of that length.Using the hierarchical text encoder,we can find out the information that needs to be emphasized at different lengths in the utterance,and effectively achieve the extraction of text features at different lengths to alle-viate the problem of uncertainty in the duration of the synthesized speech signal.Second,the hierarchical speech encoder includes three levels:phoneme level,word level,and utterance level speech encoder.For each level of speech encoder,the text features is used as the query vector of the Conformer,and the speech features are used as the keyword vector and value vector of the Confor-mer to extract the matching relationship between text features and speech features.The problem of inaccurate synthesis of diffe-rent length speech signals can be alleviated by using hierarchical speech encoder and text-to-speech matching relations.The hie-rarchical text-to-speech encoder modeled in this paper can be flexibly embedded into a variety of existing decoders to provide more reliable speech synthesis results through the complementarity between text and speech.Experimental validation is performed on two datasets,LJSpeech and LibriTTS,and experimental results show that the Mel inversion distortion of the proposed method is smaller than that of existing speech synthesis methods.

外文关键词：

Speech synthesisText encoderSpeech encoderHierarchical modelConformer

作者：

吴克伟、韩超、孙永宣、彭梦昊、谢昭

展开 >

作者单位：

大数据知识工程教育部重点实验室(合肥工业大学) 合肥 230601

情感计算与先进智能机器安徽省重点实验室(合肥工业大学) 合肥 230601

合肥工业大学计算机与信息学院合肥 230601

关键词：

语音合成文本编码器语音编码器层次化模型 Conformer

基金：

安徽省重点研究与开发计划安徽省自然科学基金中央高校基本科研业务费专项资金中央高校基本科研业务费专项资金

项目编号：

202004d070200042108085MF203PA2021GDSK0072JZ2021HGQA0219

出版年：

2024

DOI：

10.11896/jsjkx.221100125

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(2)

参考文献量30