In speech enhancement,Auto-Encoder(AE)structures are typically used to extract features automatically.However,the features obtained in this manner are singular,redundant,and cannot adequately capture the contextual dependencies of speech signals.Therefore,a speech-enhancement method,MSF-CI,that incorporates multi-scale features and contextual information is proposed.First,a multi-scale convolutional block is used to extract multi-scale features of speech signals to solve the issue of single features.Second,the attention mechanism is applied to focus on the spatial and channel key information of the extracted features to eliminate feature redundancy.Finally,a Gated Convolutional Recurrent Neural(GCRN)network is used to learn the long-span context-dependent relations of the speech signal,whereas gated linear units are employed to improve the nonlinear learning ability and thus improve the generalization of the network.Experimental results show that the proposed MSF-CI method outperforms similar single-channel speech-enhancement models such as GRN,DPT-FSNet,and U-Net in terms of speech-perception quality and the short-term objective intelligibility of enhanced speech signals at low Signal-to-Noise Ratios(SNR)and in different noise environments.Under a SNR is 0 dB,the average speech-perception quality and average speech objective intelligibility of the proposed method are 1.49 and 0.761,respectively.The generalizability of the proposed method is verified on the Ando Tibetan corpus.Additionally,its average speech-perception quality and average speech objective intelligibility improved by 20.7%and 11.3%,respectively,with respect to noise.Therefore,the MSF-CI model not only enhances speech quality and intelligibility but also provides better generalization.
关键词
语音增强/多尺度特征/注意力机制/门控卷积循环神经网络/对数能量谱
Key words
speech enhancement/multi-scale feature/attention mechanism/Gated Convolutional Recurrent Neural(GCRN)network/Logarithmic Power Spectrum(LPS)