低信噪比下基于融合网络的音素识别方法

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：针对低信噪比下音素识别准确率低的问题,提出一种新的识别方法.提取语音的Fbank特征,输入到由多头注意力机制、ResNet、BLSTM、CTC构建的A-R-B-CTC模型中进行音素识别,利用Wave-U-Net对语音特征Fbank、MFCC、GFCC、对数频谱进行图像去噪,发现Fbank特征去噪后,可以取得更低的音素错误率.在 0dB白噪声环境下采用THCHS30 数据集进行实验验证.结果表明,Fbank去噪前,所提A-R-B-CTC模型相比于BLSTM-CTC、Res-Net-BLSTM-CTC、Transformer模型,平均音素错误率分别降低了 4.38%、2.5%、1.96%;Fbank去噪后,4 种模型的音素错误率明显下降,其中所提A-R-B-CTC模型相比于其他 3 种模型性能依旧出色.此外,在其他信噪比下也达到了不错的效果.

外文标题：Phoneme recognition method based on fusion network with low signal-to-noise ratio

外文摘要：Aiming at the problem of low accuracy of phoneme recognition under low signal-to-noise ratio,a new recognition method is proposed.Firstly,the Fbank features of speech are extracted and input into the A-R-B-CTC model constructed by multi-head attention mechanism,ResNet,BLSTM,and CTC for phoneme recognition.Then,the image denoising of the speech features Fbank,MFCC,GFCC,and logarithmic spectrum is performed by utilizing Wave-U-Net,and it is found that the denoising of the Fbank features results in a more lower phoneme error rate.The THCHS30 dataset is used for exper-imental validation in a 0 dB white noise environment.The results show that before Fbank denoising,the proposed A-R-B-CTC model reduces the average phoneme error rate by 4.38%,2.5%,and 1.96%compared to the BLSTM-CTC,ResNet-BLSTM-CTC,and Transformer models,respectively;after Fbank denoising,the phoneme error rates of the four models are significantly reduced,and the proposed A-R-B-CTC model still performs well compared to the other three models.In addi-tion,good results are also achieved at other signal-to-noise ratios.

外文关键词：

phoneme recognitionWave-U-Netend-to-endmulti-headed self-attentiontransformer model

作者：

黄辉波、邵玉斌、龙华、杜庆治

展开 >

作者单位：

昆明理工大学信息工程与自动化学院,昆明 650500

关键词：

音素识别 Wave-U-Net 端到端多头自注意力机制 Transformer模型

基金：

云南省媒体融合重点实验室项目

项目编号：

220235205

出版年：

2024

DOI：

10.3979/j.issn.1673-825X.202306100189

重庆邮电大学学报(自然科学版)

重庆邮电大学

重庆邮电大学学报(自然科学版)

CSTPCD北大核心

影响因子：0.66

ISSN：1673-825X

年,卷(期)：2024.36(4)