Speaker Verification Network Based on Multi-scale Convolutional Encoder
Speaker verification is an effective biometric authentication method,and the quality of speaker embedding features largely affects the performance of speaker verification systems.Recently,the Transformer model has shown great potential in the field of automatic speech recognition,but it is difficult to extract effective speaker embedding features because the traditional self-attention mechanism of the Transformer model is weak for local feature extraction.The performance of the Transformer model in the field of speaker verification can hardly surpass that of the previous convolutional network-based models.In order to improve the Transformer's ability to extract local features,this paper proposes a new self-attention mechanism for Transformer encoder,called multi-scale convolutional self-attention encoder(MCAE).Using convolution operations of different sizes to extract multi-time-scale information and by fusing features in the time and frequency domains,it enables the model to obtain a richer rep-resentation of local features,and such an encoder design is more effective for speaker verification.It is shown experimentally that the proposed method is better in terms of comprehensive performance on three publicly available test sets.The MCAE is more lightweight compared to the conventional Transformer encoder,which is more favorable for the deployment of the model in appli-cations.