Voiceprint recognition technology is not only widely used in the field of human identity verification,but also has made some progress in animal species recognition.Existing models suffer from insufficient feature expression ability,while the time complexi-ty and inference speed of the models need to be optimized under the premise of guaranteeing performance.In this paper,we proposed a novel architecture of Enhanced Res2block connected Enhanced Context Aware Masking(ERes-ECAM)for vocal animal embedding learning,which adopts Densely-connected Time Delay Neural Network(D-TDNN)as the backbone,and in order to solve the problem of fuzzy irrelevant noise while being able to extract more effective key information,an Enhanced Context Aware Masking(ECAM)mod-ule with a multi-granularity pooling method is used in the D-TDNN layer,and the front-end is connected to a residual module,and the features extracted within the residual block are fused to extract local information by means of Local Feature Fusion(LFF),which im-proves the accuracy and robustness of the voiceprint verification system.As described in this paper,experiments were conducted in two test sets,Anim-Celeb and Pig-Celeb,and experimental results showed that the Equal Error Rate(EER)of the proposed architecture reached 6.88%and 7.24%,respectively,and at the same time,the accuracies of recognizing the animal species and the pig species reached 93.12%and 92.76%.
deep learningvoiceprint recognitioncontext aware maskingLFFanimal species recognition