首页|基于双路径递归网络与Conv-TasNet的多头注意力机制视听语音分离

基于双路径递归网络与Conv-TasNet的多头注意力机制视听语音分离

扫码查看
目前的视听语音分离模型基本是将视频特征和音频特征进行简单拼接,没有充分考虑各个模态的相互关系,导致视觉信息未被充分利用,分离效果不理想.该文充分考虑视觉特征、音频特征之间的相互联系,采用多头注意力机制,结合卷积时域分离模型(Conv-TasNet)和双路径递归神经网络(DPRNN),提出多头注意力机制时域视听语音分离(MHATD-AVSS)模型.通过音频编码器与视觉编码器获得音频特征与视频的唇部特征,并采用多头注意力机制将音频特征与视觉特征进行跨模态融合,得到融合视听特征,将其经DPRNN分离网络,获得不同说话者的分离语音.利用客观语音质量评估(PESQ)、短时客观可懂度(STOI)及信噪比(SNR)评价指标,在VoxCeleb2数据集进行实验测试.研究表明,当分离两位、3位或4位说话者的混合语音时,该文方法与传统分离网络相比,SDR提高量均在1.87 dB以上,最高可达2.29 dB.由此可见,该文方法能考虑音频信号的相位信息,更好地利用视觉信息与音频信息的相关性,提取更为准确的音视频特性,获得更好的分离效果.
Multi-Head Attention Time Domain Audiovisual Speech Separation Based on Dual-Path Recurrent Network and Conv-TasNet
The current audiovisual speech separation model is essentially the simple splicing of video and audio features without fully considering the interrelationship of each modality, resulting in the underutilization of visual information and unsatisfactory separation effects. The article adequately considers the interconnection between visual features and audio features, adopts a multi-headed attention mechanism, and combines the Convolutional Time-domain audio separation Network (Conv-TasNet) and Dual-Path Recurrent Neural Network (DPRNN), the Multi-Head Attention Time Domain AudioVisual Speech Separation (MHATD-AVSS) model is proposed. The audio encoder and the visual encoder are used to obtain the audio features and the lip features of the video, and the multi-head attention mechanism is used to cross-modality fuse the audio features with the visual features to obtain the audiovisual fusion features, which are passed through the DPRNN separation network to obtain the separated speech of different speakers. The Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Signal-to-Noise Ratio (SNR) evaluation metrics are used for experimental testing in the VoxCeleb2 dataset. The research shows that when separating the mixed speech of two, three, or four speakers, the SDR improvement of the method in this paper is above 1.87 dB and up to 2.29 dB compared with the traditional separation network. In summary, this article shows that the method can consider the phase information of the audio signal, better use the correlation between visual information and audio information, extract more accurate audio and video characteristics, and obtain better separation effects.

Speech separationAudiovisual fusionCross-modal attentionDual-path recurrent networkConv-TasNet

兰朝凤、蒋朋威、陈欢、赵世龙、郭小霞、韩玉兰、韩闯

展开 >

哈尔滨理工大学测控技术与通信工程学院 哈尔滨 150080

哈尔滨工大卫星技术有限公司 哈尔滨 150023

中国舰船研究设计中心 武汉 430064

语音分离 视听融合 跨模态注意力 双路径递归网络 Conv-TasNet

国家自然科学基金黑龙江省自然科学基金

11804068LH2020F033

2024

电子与信息学报
中国科学院电子学研究所 国家自然科学基金委员会信息科学部

电子与信息学报

CSTPCD北大核心
影响因子:1.302
ISSN:1009-5896
年,卷(期):2024.46(3)
  • 27