首页|基于CNN-Transformer的欺骗语音检测

基于CNN-Transformer的欺骗语音检测

扫码查看
语音合成和转换技术的不断更迭对声纹识别系统产生重大威胁.针对现有语音欺骗检测方法中难以适应多种欺骗类型,对未知欺骗攻击检测能力不足的问题,提出了一种结合卷积神经网络(Convolutional Neural Network,CNN)与Transformer的欺骗语音检测模型.设计基于坐标注意力(Coordinate Attention,CA)嵌入的SE-ResNet18的位置感知特征序列提取网络,将语音信号局部时频表示映射为高维特征序列并引入二维位置编码(two-Dimensional Position Encoding,2D-PE)保留特征之间的相对位置关系;提出多尺度自注意力机制从多个尺度建模特征序列之间的长期依赖关系,解决Trans-former 难以捕捉局部依赖的问题;引入特征序列池化(Sequence Pooling,SeqPool)提取话语级特征,保留Transformer层输出帧级特征序列之间的相关性信息.在ASVspoof2019大赛官方逻辑访问(Logic Access,LA)数据集的实验结果表明,提出的方法相对于当前先进的欺骗语音检测系统,等错误率(Equal Error Rate,EER)平均降低12.83%,串联检测成本函数(tandem Detection Cost Function,t-DCF)平均降低 7.81%.
Spoofed Speech Detection Based on CNN-Transformer
The continuous change of speech synthesis and conversion technology poses a major threat to the voiceprint recognition system.To deal with the problem that the existing voice spoofing detection methods are difficult to adapt to multiple spoofing types and have insufficient ability to detect unknown spoofing attacks,a spoofed speech detection model combining Convolutional Neural Network(CNN)and Transformer is proposed.A location-aware feature sequence extraction network based on SE-ResNet18 embedded with Coordinate Attention(CA)is designed,which maps the local time-frequency representation of speech signals into high-dimensional feature sequences and introduces two-Dimensional Position Encoding(2D-PE)to preserve the relative position relationship between features.The multi-scale self-attention mechanism is proposed to model the long-term dependence between feature sequences from multiple scales,which solves the problem that it is difficult for Transformer to capture local dependencies.Feature Sequence Pooling(SeqPool)is introduced to extract discourse-level features,and the correlation information between frame-level feature sequences output by the Transformer layer is retained.The experimental results on the official Logic Access(LA)data set of the ASVspoof2019 competition show that,compared with the current advanced spoofed speech detection system,the proposed method reduces the Equal Error Rate(EER)by an average of 12.83%,and the tandem Detection Cost Function(t-DCF)by an average of 7.81%.

spoofed speech detectionposition aware sequenceTransformerfeature SeqPool

徐童心、黄俊

展开 >

重庆邮电大学通信与信息工程学院,重庆 400065

欺骗语音检测 位置感知序列 Transformer 特征序列池化

国家自然科学基金

61771085

2024

无线电工程
中国电子科技集团公司第五十四研究所

无线电工程

影响因子:0.667
ISSN:1003-3106
年,卷(期):2024.54(5)
  • 19