Speech Enhancement Network Based on Parallel Multi-Attention
Regarding the issue of the frequency-domain enhancement of speech affected by interference,a speech enhancement network based on a parallel multi-attention mechanism and an encoding and decoding structure,known as PM AN,is proposed.The network uses speech frequency-domain features obtained through a Short-Time Fourier Transform(STFT),including amplitude and complex spectra.The encoder integrates input data using dense convolutional modules.The parallel multi-attention module of the intermediate layer learns both local and global information in the frequency-domain and incorporates a Local Patch Attention(LPA)mechanism to capture the Two-Dimensional(2D)structure of the speech frequency-domain,achieving separation between clean speech and interference factors in the 2D space.The decoder integrates the learned information and generates amplitude masks and complex spectra separately.The final speech complex spectrum is obtained via weighted summation,and a joint time-and frequency-domain loss function is used to fuse the phase information.Experimental results on the VoiceBank+DEMAND speech dataset demonstrate that PMAN achieves better speech enhancement performance than a time-domain speech enhancement Neural Network based on a Two-Stage Transformer(TSTNN),with improvements of 10.8%in Perceptual Evaluation of Speech Quality(PESQ),1.1%in Short-Time Objective Intelligibility(STOI),and 11.8%in Segmental Signal-to-Noise Ratio(SSNR).