首页|基于时序对齐的风格控制语音合成算法

基于时序对齐的风格控制语音合成算法

扫码查看
语音合成风格控制的目标是将自然语言转化为对应富有表现力的音频输出.基于Transformer的风格控制语音合成算法能在保持质量的情况下提高了合成速度,但仍存在不足:第一,在风格参考音频和文本长度差异大的情况下,存在合成音频部分风格缺失的问题;第二,基于普通注意力的解码过程容易出现复读、漏读以及跳读的问题.针对以上问题,提出了一种基于时间对齐的风格控制语音合成算法(Temporal Alignment Text-to-Speech,TATTS)分别在编码和解码过程中有效利用时序信息.在编码过程中,TATTS提出了时序对齐的交叉注意力模块联合训练风格音频与文本表示,解决了不等长音频文本的对齐问题;在解码过程中,TATTS考虑了音频时序单调性,在Transformer解码器中引入了逐步单调的多头注意力机制,解决了合成音频中出现的错读问题.与基准模型相比,TATTS在LJSpeech和VCTK数据集上音频结果自然度分别提升了3.8%和4.8%,在VCTK数据集上风格相似度提升了10%,验证了该语音合成算法的有效性,并且体现出风格控制与迁移能力.
Temporal Alignment Style Control in Text-to-Speech Synthesis Algorithm
The goal of speech synthesis style control is to convert natural language into corresponding expressive audio output.The speech synthesis style control algorithm based on Transformer can improve synthesis speed while maintain its quality.But there still exist some shortcomings.Firstly,there is a problem of missing style in synthesized audio,when there is a large disparity between the length of the style reference audio and text.Secondly,the decoding process based on vanilla attention is prone to problems of repeating,omission and skipping.To address the above problems,a temporal alignment style control speech synthesis algorithm TATTS is proposed,which can effectively utilize temporal information in the encoding and decoding processes,respectively.In the encoding process,TATTS proposes a temporal alignment cross-attention module to jointly train style audio and text representations,which can solve the alignment problem of unequal-length audio and texts.In the decoding process,TATTS considers the monotonicity of audio timing.And a stepwise monotonic multi-head attention mechanism in the Transformer decoder is proposed to solve the problem of misreading in synthesized audio.The experimental results show that,compared with the baseline model,TATTS has increased the naturalness index of audio results on the LJSpeech and VCTK datasets by 3.8%and 4.8%,respectively,and the style similarity index on the VCTK dataset has increased by 10%.Experimental results demonstrate the effectiveness of the synthetic algorithm,and the ability to style control and transfer.

speech synthesistemporal alignmentstyle controlTransformerstyle transfer

郭傲、许柏炎、蔡瑞初、郝志峰

展开 >

广东工业大学 计算机学院,广东 广州 510006

汕头大学 理学院,广东 汕头 515063

语音合成 时序对齐 风格控制 Transformer 风格迁移

国家自然科学基金国家自然科学基金国家自然科学基金国家优秀青年科学基金科技创新2030新一代人工智能重大项目

618760436197605262206064621220222021ZD0111501

2024

广东工业大学学报
广东工业大学

广东工业大学学报

影响因子:0.628
ISSN:1007-7162
年,卷(期):2024.41(2)
  • 28