Phoneme recognition method based on fusion network with low signal-to-noise ratio
Aiming at the problem of low accuracy of phoneme recognition under low signal-to-noise ratio,a new recognition method is proposed.Firstly,the Fbank features of speech are extracted and input into the A-R-B-CTC model constructed by multi-head attention mechanism,ResNet,BLSTM,and CTC for phoneme recognition.Then,the image denoising of the speech features Fbank,MFCC,GFCC,and logarithmic spectrum is performed by utilizing Wave-U-Net,and it is found that the denoising of the Fbank features results in a more lower phoneme error rate.The THCHS30 dataset is used for exper-imental validation in a 0 dB white noise environment.The results show that before Fbank denoising,the proposed A-R-B-CTC model reduces the average phoneme error rate by 4.38%,2.5%,and 1.96%compared to the BLSTM-CTC,ResNet-BLSTM-CTC,and Transformer models,respectively;after Fbank denoising,the phoneme error rates of the four models are significantly reduced,and the proposed A-R-B-CTC model still performs well compared to the other three models.In addi-tion,good results are also achieved at other signal-to-noise ratios.
phoneme recognitionWave-U-Netend-to-endmulti-headed self-attentiontransformer model