Speech Enhancement Based on Time-frequency Self-attention Residual Temporal Convolutional Networks
The main purpose of speech enhancement(SE)is to remove irrelevant signals such as noise.It is the front-end pro-cessing part of many speech processing tasks.SE plays an important role in fields such as video conferencing and live broadcast-ing.However,most studies on SE mainly focuses on the long-term context-dependent modeling of speech frames,without con-sidering the energy distribution characteristics in the time-frequency domain.This paper proposes a self-attention module based on time-frequency domain,which makes it possible to explicitly introduce a priori thinking about speech distribution characteris-tics in the process of model modeling.Combined with the residual temporal convolutional network,a residual temporal convolu-tional network model based on time-frequency domain self-attention is constructed.In order to verify the validity of the model,two training targets,IRM and PSM,which are commonly used in the field of SE,are used for experiments.The experimental re-sults show that the model significantly improves the performance in terms of four objective evaluation metrics in SE and is consis-tently better than other baseline models.