Temporal Alignment Style Control in Text-to-Speech Synthesis Algorithm
The goal of speech synthesis style control is to convert natural language into corresponding expressive audio output.The speech synthesis style control algorithm based on Transformer can improve synthesis speed while maintain its quality.But there still exist some shortcomings.Firstly,there is a problem of missing style in synthesized audio,when there is a large disparity between the length of the style reference audio and text.Secondly,the decoding process based on vanilla attention is prone to problems of repeating,omission and skipping.To address the above problems,a temporal alignment style control speech synthesis algorithm TATTS is proposed,which can effectively utilize temporal information in the encoding and decoding processes,respectively.In the encoding process,TATTS proposes a temporal alignment cross-attention module to jointly train style audio and text representations,which can solve the alignment problem of unequal-length audio and texts.In the decoding process,TATTS considers the monotonicity of audio timing.And a stepwise monotonic multi-head attention mechanism in the Transformer decoder is proposed to solve the problem of misreading in synthesized audio.The experimental results show that,compared with the baseline model,TATTS has increased the naturalness index of audio results on the LJSpeech and VCTK datasets by 3.8%and 4.8%,respectively,and the style similarity index on the VCTK dataset has increased by 10%.Experimental results demonstrate the effectiveness of the synthetic algorithm,and the ability to style control and transfer.
speech synthesistemporal alignmentstyle controlTransformerstyle transfer