[目的]语音增强可用于提升现实噪声环境下语音翻译系统的性能.针对现有基于概率扩散模型的语音增强方法存在生成语音结构被破坏、难以对全局特征建模的问题进行研究.[方法]本文提出基于时频信息梯度估计的单通道语音增强方法.首先将语音复数谱送入编码器中提取深层表征,并提出将残差快速傅里叶卷积(residual fast fourier convolution,Res-FFC)用于修复生成语音并对语音全局特征进行建模,同时在编解码的过程中融入语音时域信息.[结果]在公开数据集Voice Bank-DEMAND上的实验结果表明,相比基于分数生成模型的复数时频域语音增强网络(SGMSE),本文所提方法在客观评价指标SI-SDR和WB-PESQ分别提高0.5和0.19.[结论]本文提出的语音增强方法通过融入Res-FFC和语音时域信息,提升了模型对语音全局特征的捕捉能力,可有效抑制噪声,提升语音质量.
Single-channel speech enhancement method based on time-frequency information gradient estimation
[Objective]Speech enhancement can be used to improve the performance of speech translation systems in real-world noisy environments.Herein our research is conducted to address issues of existing speech enhancement methods based on probabilistic diffusion models,such as the disruption of generated speech structure and the difficulty in modeling global features.[Methods]In this paper,we propose a single-channel speech enhancement method based on time-frequency information gradient estimation.Initially,the speech complex spectrum is fed into an encoder to extract deep representations.It introduces the usage of residual fast Fourier convolution(Res-FFC)to restore generated speech and model global speech features,while incorporating speech temporal information in the encoding and decoding process.[Results]Experimental results on the public dataset Voice Bank-DEMAND demonstrate that,compared to the complex time-frequency domain speech enhancement network based on fraction generating models(SGMSE),the proposed method improves the objective evaluation metrics SI-SDR and WB-PESQ by 0.5 and 0.19,respectively.[Conclusions]The proposed speech-enhancement method enhances the ability of the model to capture global speech features by incorporating Res-FFC and temporal information of the speech,effectively suppressing noises and improving the speech quality.