Multi-channel Speech Enhancement Based on Fourier Convolution
The construction of neural beamformer is one of the main methods to deal with multi-channel speech enhancement tasks,which filters the multi-channel signals to obtain target speech by solving the beam weights.Similar to the principle of the solution of spatial covariance matrix in traditional beamforming,spectral-spatial information also plays a crucial role in the beam weights prediction of neural beamformer.However,due to the lack of adequate learning of spectral-spatial information,many existing efforts fail to optimally predict the beam weights.In order to deal with this challenge,a context feature extractor based on Fourier convolution is proposed,with which a global receptive field on the frequency is involved.Besides,the modeling of temporal context information is also realized by adding a time-frequency convolutional module to boost the learning of context from input spectrograms.In addition,a Convolutional Recurrent Network(CRN)structure is applied,in which the proposed context feature extractor is embedded in the encoders and decoders,and a Convolutional Block Attention Module(CBAM)is involved in the skip connection.The proposed CRN structure can capture the time-frequency context information and cross-channel spatial features sufficiently from the input spectrograms.Experimental results show that the parameter quantity of the proposed approach is only 1.14 M,which indicates great superiority over the existing advanced baseline systems.