基于多尺度卷积编码器的说话人验证网络
Speaker Verification Network Based on Multi-scale Convolutional Encoder
刘小湖 1陈德富 1李俊 2周旭文 1胡姗 1周浩1
作者信息
- 1. 浙江工业大学信息工程学院 杭州 310023
- 2. 浙江讯飞智能科技有限公司 杭州 310000
- 折叠
摘要
说话人验证是一种有效的生物身份验证方法,说话人嵌入特征的质量在很大程度上影响着说话人验证系统的性能.最近,Transformer模型在自动语音识别领域展现出了巨大的潜力,但由于Transformer中传统的自注意力机制对局部特征的提取能力较弱,难以提取有效的说话人嵌入特征,因此Transformer模型在说话人验证领域的性能难以超越以往的基于卷积网络的模型.为了提高Transformer对局部特征的提取能力,文中提出了一种新的自注意力机制用于Transformer编码器,称为多尺度卷积自注意力编码器(Multi-scale Convolutional Self-Attention Encoder,MCAE).利用不同尺度的卷积操作来提取多时间尺度信息,并通过融合时域和频域的特征,使模型获得更丰富的局部特征表示,这样的编码器设计对于说话人验证是更有效的.通过实验表明,在3个公开的测试集上,所提方法的综合性能表现更佳.与传统的Transformer编码器相比,MCAE也是更轻量级的,这更有利于模型的应用部署.
Abstract
Speaker verification is an effective biometric authentication method,and the quality of speaker embedding features largely affects the performance of speaker verification systems.Recently,the Transformer model has shown great potential in the field of automatic speech recognition,but it is difficult to extract effective speaker embedding features because the traditional self-attention mechanism of the Transformer model is weak for local feature extraction.The performance of the Transformer model in the field of speaker verification can hardly surpass that of the previous convolutional network-based models.In order to improve the Transformer's ability to extract local features,this paper proposes a new self-attention mechanism for Transformer encoder,called multi-scale convolutional self-attention encoder(MCAE).Using convolution operations of different sizes to extract multi-time-scale information and by fusing features in the time and frequency domains,it enables the model to obtain a richer rep-resentation of local features,and such an encoder design is more effective for speaker verification.It is shown experimentally that the proposed method is better in terms of comprehensive performance on three publicly available test sets.The MCAE is more lightweight compared to the conventional Transformer encoder,which is more favorable for the deployment of the model in appli-cations.
关键词
说话人验证/说话人嵌入/自注意力机制/Transformer编码器/多尺度卷积Key words
Speaker verification/Speaker embedding/Self-attention mechanism/Transformer encoder/Multi-scale convolution引用本文复制引用
基金项目
杭州市重大科技创新项目(2022AIZD0055)
出版年
2024