首页|基于样本类不确定性抽样的端到端语音关键词检测训练方法

基于样本类不确定性抽样的端到端语音关键词检测训练方法

扫码查看
当前语音关键词检测主流技术为端到端的深度学习方法,研究重点为网络结构优化、建模单元选取及搜索策略等,并取得较快进展,但对模型训练效率的关注相对较少.本文针对深度学习模型训练效率问题,提出了一种样本类不确定性抽样(Class Uncertainty Sampling,CUS)的样本应用策略加速收敛进程.其核心思想是在模型训练中后期,利用网络的前向输出层对样本评价信息进行样本类不确定性度量,并转化成样本选用概率,随机抽取训练样本子集用于后续训练.由于简单样本的类确定度高,它们参与后续训练的概率降低,但不影响模型的区分能力,增强对判决边界样本的关注,达到提高模型训练效率的目标.基于AISHELL-1普通话数据集的实验结果表明,相对常规训练策略,平均训练时长缩短60%,收敛时长缩短47.5%.虚警率(False Alarm Rate,FAR)为0.5 FP/h时,该方法的错误拒绝率(False Reject Rate,FRR)从4.75%降至3.65%,相对下降30.1%,最大关键词加权值(Maximum Term Weighted Value,MTWV)由0.837 4升至0.853 1.通过分析错标样本参与训练的行为,证实了该方法具有屏蔽掉大部分错误标注样本的能力,减少错标样本对训练的损害.基于大规模AISHELL-2普通话数据集的实验进一步证实了提出方法的有效性.
End-to-End Speech Keyword Spotting Training Method Based on Sample's Class Uncertainty
End-to-end deep learning is the main technology for speech keyword spotting.The research focused on ex-ploring better network structures,modeling units,and search strategies,and have made a lot of progress.However,less at-tention is paid on training efficiency.In this paper,a novel class uncertainty sampling(CUS)strategy is proposed to select effective samples for each training epoch.Since only a subset is used,much training time is saved.The core idea of CUS is measuring the class uncertainty of samples with the forward information of the output layer during the middle and late train-ing stages,and samples are selected at a probability of their class uncertainty.Therefore more attention is paid to samples nearing the decision boundary,which are prone to missed detection or false alarm.Furthermore,the proposed method could shield the interference of label error samples.Experimental results on the AISHELL-1 Mandarin dataset showed that fast convergence and better training performance were achieved.Against the conventional training strategy,the average training time and the average converging time was relatively shortened by 60%and 47.5%,respectively.At 0.5 FP/h false accept rate(FAR),the false reject rate(FRR)was reduced from 4.75%to 3.65%,a relative reduction of 30.1%,and the maximum term weighted value(MTWV)was increased from 0.837 4 to 0.853 1.Moreover,it was experimentally verified that the method could shield most of the mislabeled samples.This conclusion was confirmed with the experiments on the large-scale AISHELL-2 Mandarin dataset.

speech keyword spottingdeep learningend-to-endclass uncertainty sampling

贺前华、陈永强、郑若伟、黄金鑫

展开 >

华南理工大学电子与信息学院,广东 广州 510641

语音关键词检测 深度学习 端到端 类不确定性抽样

广东省科技计划项目广东省科技计划项目国家自然科学基金

2023A05050501162022A151501168762371195

2024

电子学报
中国电子学会

电子学报

CSTPCD北大核心
影响因子:1.237
ISSN:0372-2112
年,卷(期):2024.52(10)