Time-domain Speech Separation Based on a Fully Convolutional Neural Network with Multitask Learning
When speech separation is performed based on a time-frequency mask using a deep neural network,the phase spec-trum of the mixed signal is commonly used as the target signal phase,and the special processing for gender combination is lack-ing,which results in poor quality of separated speech.Aiming to address the problem,this study introduces a novel speech sepa-ration approach in the time domain based on a fully convolutional network and gender combination detection(FCN-GCD)with multitask learning.Its network is primarily composed of a speech separation module and a mixed speech gender combination de-tection module.In the speech separation module,an FCN is constructed,where the input of the network is time-domain mixed speech signals of two people,and the output is the clean speech signal of the target speaker.The FCN compressed features along the convolutional layers of the encoder and reconstructed features along the deconvolutional layers of the decoder,achieving end-to-end speech separation.Additionally,by employing the multitask learning approach,the GCD task for mixed speech is in-tegrated into the speech separation network.Under the joint constraint of the two tasks,both the auxiliary information and speech separation features are obtained simultaneously.Subsequently,these deep features are combined to enhance the separa-tion capability of the model for the mixed speech of different gender combinations.By incorporating the GCD task for the mixed speech as a secondary task in the speech separation network,parameter sharing is achieved between the main and secondary tasks,thereby strengthening the speech separation capability for the primary task.Compared with frequency domain methods,the proposed FCN-GCD method in the time domain eliminates the necessity for phase recovery and frequency-to-time recon-struction,which simplifies the processing and improves computational efficiency.Furthermore,it can extract effective auxiliary information features from the GCD task for mixed speech,achieving more effective speech separation.The results of the experi-ment demonstrate that compared with single-task speech separation methods,the proposed method improves the quality of speech in three gender combinations:male-male,female-female,and male-female,and achieves better performance on evalua-tion metrics such as Perceptual Evaluation of Speech Quality(PESQ),Short-Time Objective Intelligibility(STOI),Signal-to-Interference Ratio(SIR),Signal-to-Distortion Ratio(SDR),and Signal-to-Artifact Ratio(SAR).
deep neural networkspeech separationfully convolutional neural networkfeature fusionmultitask learning