电子学报2024,Vol.52Issue(10) :3482-3492.DOI:10.12263/DZXB.20240048

基于样本类不确定性抽样的端到端语音关键词检测训练方法

End-to-End Speech Keyword Spotting Training Method Based on Sample's Class Uncertainty

贺前华 陈永强 郑若伟 黄金鑫
电子学报2024,Vol.52Issue(10) :3482-3492.DOI:10.12263/DZXB.20240048

基于样本类不确定性抽样的端到端语音关键词检测训练方法

End-to-End Speech Keyword Spotting Training Method Based on Sample's Class Uncertainty

贺前华 1陈永强 1郑若伟 1黄金鑫1
扫码查看

作者信息

  • 1. 华南理工大学电子与信息学院,广东 广州 510641
  • 折叠

摘要

当前语音关键词检测主流技术为端到端的深度学习方法,研究重点为网络结构优化、建模单元选取及搜索策略等,并取得较快进展,但对模型训练效率的关注相对较少.本文针对深度学习模型训练效率问题,提出了一种样本类不确定性抽样(Class Uncertainty Sampling,CUS)的样本应用策略加速收敛进程.其核心思想是在模型训练中后期,利用网络的前向输出层对样本评价信息进行样本类不确定性度量,并转化成样本选用概率,随机抽取训练样本子集用于后续训练.由于简单样本的类确定度高,它们参与后续训练的概率降低,但不影响模型的区分能力,增强对判决边界样本的关注,达到提高模型训练效率的目标.基于AISHELL-1普通话数据集的实验结果表明,相对常规训练策略,平均训练时长缩短60%,收敛时长缩短47.5%.虚警率(False Alarm Rate,FAR)为0.5 FP/h时,该方法的错误拒绝率(False Reject Rate,FRR)从4.75%降至3.65%,相对下降30.1%,最大关键词加权值(Maximum Term Weighted Value,MTWV)由0.837 4升至0.853 1.通过分析错标样本参与训练的行为,证实了该方法具有屏蔽掉大部分错误标注样本的能力,减少错标样本对训练的损害.基于大规模AISHELL-2普通话数据集的实验进一步证实了提出方法的有效性.

Abstract

End-to-end deep learning is the main technology for speech keyword spotting.The research focused on ex-ploring better network structures,modeling units,and search strategies,and have made a lot of progress.However,less at-tention is paid on training efficiency.In this paper,a novel class uncertainty sampling(CUS)strategy is proposed to select effective samples for each training epoch.Since only a subset is used,much training time is saved.The core idea of CUS is measuring the class uncertainty of samples with the forward information of the output layer during the middle and late train-ing stages,and samples are selected at a probability of their class uncertainty.Therefore more attention is paid to samples nearing the decision boundary,which are prone to missed detection or false alarm.Furthermore,the proposed method could shield the interference of label error samples.Experimental results on the AISHELL-1 Mandarin dataset showed that fast convergence and better training performance were achieved.Against the conventional training strategy,the average training time and the average converging time was relatively shortened by 60%and 47.5%,respectively.At 0.5 FP/h false accept rate(FAR),the false reject rate(FRR)was reduced from 4.75%to 3.65%,a relative reduction of 30.1%,and the maximum term weighted value(MTWV)was increased from 0.837 4 to 0.853 1.Moreover,it was experimentally verified that the method could shield most of the mislabeled samples.This conclusion was confirmed with the experiments on the large-scale AISHELL-2 Mandarin dataset.

关键词

语音关键词检测/深度学习/端到端/类不确定性抽样

Key words

speech keyword spotting/deep learning/end-to-end/class uncertainty sampling

引用本文复制引用

基金项目

广东省科技计划项目(2023A0505050116)

广东省科技计划项目(2022A1515011687)

国家自然科学基金(62371195)

出版年

2024
电子学报
中国电子学会

电子学报

CSTPCD北大核心
影响因子:1.237
ISSN:0372-2112
段落导航相关论文