Neural Networks2022,Vol.15015.DOI:10.1016/j.neunet.2022.03.003

Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

Hou, Jingyong Xie, Lei Zhang, Shilei
Neural Networks2022,Vol.15015.DOI:10.1016/j.neunet.2022.03.003

Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

Hou, Jingyong 1Xie, Lei 1Zhang, Shilei2
扫码查看

作者信息

  • 1. Sch Comp Sci,Northwestern Polytech Univ
  • 2. China Mobile Res Inst
  • 折叠

Abstract

A keyword spotting (KWS) system running on smart devices should accurately detect the appearances and predict the locations of predefined keywords from audio streams, with small footprint and high efficiency. To this end, this paper proposes a new two-stage KWS method which combines a novel multi-scale depthwise temporal convolution (MDTC) feature extractor and a two-stage keyword detection and localization module. The MDTC feature extractor learns multi-scale feature representation efficiently with dilated depthwise temporal convolution, modeling both the temporal context and the speech rate variation. We use a region proposal network (RPN) as the first-stage KWS. At each frame, we design multiple time regions, which all take the current frame as the end position but have different start positions. These time regions (or formally anchors) are used to indicate rough location candidates of keyword. With frame level features from the MDTC feature extractor as inputs, RPN learns to propose keyword region proposals based on the designed anchors. To alleviate the keyword/non-keyword class imbalance problem, we specifically introduce a hard example mining algorithm to select effective negative anchors in RPN training. The keyword region proposals from the first-stage RPN contain keyword location information which is subsequently used to explicitly extract keyword related sequential features to train the second-stage KWS. The second-stage system learns to classify and transform region proposal to keyword IDs and ground-truth keyword region respectively. Experiments on the Google Speech Command dataset show that the proposed MDTC feature extractor surpasses several competitive feature extractors with a new state-of-the-art command classification error rate of 1.74%. With the MDTC feature extractor, we further conduct wake-up word (WuW) detection and localization experiments on a commercial WuW dataset. Compared to a strong baseline, our proposed two-stage method achieves relatively 27-32% better false rejection rate at one false alarm per hour, while for keyword localization, the two-stage approach achieves more than 0.95 mean intersection-over-union ratio, which is clearly better than the one-stage RPN method.(c) 2022 Elsevier Ltd. All rights reserved.

Key words

spotting/Wake-up word detection and localization/Temporal convolution/Multi-scale/Two-stage/NETWORK

引用本文复制引用

出版年

2022
Neural Networks

Neural Networks

EISCI
ISSN:0893-6080
被引量2
参考文献量74
段落导航相关论文