基于时频自注意力残差时序卷积网络的语音增强

Speech Enhancement Based on Time-frequency Self-attention Residual Temporal Convolutional Networks

候聪颖 ¹杨文清 ¹王召 ¹程聪¹

扫码查看

作者信息

1. 国电南瑞科技股份有限公司,江苏南京 211000
折叠

摘要

语音增强的主要目的是去除语音信号中的噪声等无关信号,是许多语音处理任务的前端处理部分,在视频会议、视频直播等领域都有着重要的作用.然而目前大多数语音增强的研究主要集中在语音帧的长期上下文依赖关系建模上,没有考虑语音在时频域上的能量分布特征.本文提出一种基于时频域的自注意力模块,使得在模型建模过程中可以显式引入对语音分布特性的先验思考,并与残差时序卷积网络相结合,构成基于时频域自注意力的残差时序卷积网络模型.为了验证该模型的有效性,本文使用语音增强领域中常用的2个训练目标IRM和PSM进行实验,实验结果表明,该模型显著提高了语音增强领域中4种常用的客观评价指标,明显优于其他基准模型.

Abstract

The main purpose of speech enhancement(SE)is to remove irrelevant signals such as noise.It is the front-end pro-cessing part of many speech processing tasks.SE plays an important role in fields such as video conferencing and live broadcast-ing.However,most studies on SE mainly focuses on the long-term context-dependent modeling of speech frames,without con-sidering the energy distribution characteristics in the time-frequency domain.This paper proposes a self-attention module based on time-frequency domain,which makes it possible to explicitly introduce a priori thinking about speech distribution characteris-tics in the process of model modeling.Combined with the residual temporal convolutional network,a residual temporal convolu-tional network model based on time-frequency domain self-attention is constructed.In order to verify the validity of the model,two training targets,IRM and PSM,which are commonly used in the field of SE,are used for experiments.The experimental re-sults show that the model significantly improves the performance in terms of four objective evaluation metrics in SE and is consis-tently better than other baseline models.

关键词

语音增强/时频域/自注意力机制/时序卷积网络

Key words

speech enhancement/time-frequency/self-attention mechanism/temporal convolutional network

引用本文复制引用

基金项目

国电南瑞南京控制系统有限公司项目(524609230006)

出版年

2024

计算机与现代化

江西省计算机学会江西省计算技术研究所

计算机与现代化

CSTPCD

影响因子：0.472

ISSN：1006-2475

参考文献量3

段落导航