An attention-based time-domain speech separation method in noisy environments
Deep learning-based time-domain single-channel speech separation models have achieved significant success in noise-free scenarios.However,they tend to mistakenly encode noise features as source speech features in noisy environments,which affects the accuracy of mask estimation and results in suboptimal separation performance.To deal with this problem,we propose a time-domain speech separation model based on attention mechanisms to mitigate the negative impact of noise on separation performance.First,given the disparate importance of channels in the output features from the temporal encoder,we introduce an efficient channel attention(EC A)module embedded within the encoder to perform weighted processing on the channel-wise features.Second,we adopt a graph attention network(GAT)to compute attention coefficients between adjacent frames for the aggregation of encoded features from neighboring frames,thus the influence of noise on mask estimation can be reduced.Experimental results on the WHAM!,Libri2Mix-Noisy,and Libri3 Mix-Noisy datasets demonstrate that the proposed GAT-ECA-based DPRNN(GACA-DPRNN)outperforms the DPRNN baseline in terms of scale invariant signal-to-noise ratio improvement(SI-SNRi)and signal distortion ratio improvement(SDRi).