Multi-source Separation Method Based on Improved Transformer Model
The current mainstream speech separation algorithm models are all based on complex recursive network or Transformer network.The high complexity of Transformer network leads to difficult training,and the high sampling rate of audio leads to the use of long input at the sample level to obtain incomplete features.The feature loss problem occurs when long speech feature sequences cannot be directly modeled.For this,we propose an improved network model based on Transformer.Firstly,a new subsample block is added to the existing Transformer network model encoder to calculate advanced features on different time scales and reduce feature space complexity.Secondly,feature fusion between the upper sampling layer and the lower sampling layer of the encoder is added to the decoder of the Transformer network model to ensure no feature loss and improve model separation capability.Finally,an improved sliding window attention mechanism is introduced in the model separation layer.The sliding window uses circular shift technology,and the new feature window contains part of the old feature window and feature edge information to complete the information interaction between feature Windows,obtain feature coding and feature position coding,and improve the correlation coefficient between feature infor-mation.The experiment shows that the separation effect is better than that of the previous method,with SI-SNR evaluation standard reaching13.5 dB and SDR evaluation index reaching14.1 dB.
upper and lower sampling layerTransformerfeature codingsliding window attention mechanismdeep learning