Target Speech Extraction Based on Cross-modal Attention
Target speech extraction,as part of the speech separation field,aims to extract target speech from mixed-speech data.Considering the natural consistency between visual and auditory information,integrating visual information during model training can guide the model in extracting the target speech.The traditional method involves concatenating visual and audio features,followed by convolution operations for channel fusion.However,this method fails to effectively explore the correlation between cross-modal information.A two-stage cross-modal attention feature fusion module was proposed to address this problem.First,dot-product attention was used to explore the shallow correlation between cross-modal information.Second,self-attention was employed to capture the global dependency among the target speech features,thereby enhancing the representation of the target speech.The two fusion stages trained different learnable parameters to adjust the attention weights.Additionally,a Gated Recurrent Unit(GRU)was introduced into the Temporal Convolutional Network(TCN)to enhance the ability to capture long-term dependencies among sequential data,thereby improving visual feature extraction and enhancing the fusion of audio-visual features.Finally,experiments were conducted on the VoxCeleb2 and LRS2-BBC datasets.The proposed method performed favorably well compared to those of baseline methods,achieving improvements of 1.05 dB and 0.26 dB on the respective datasets,using the Source-to-Distortion Ratio(SDR)evaluation metric.