Multi-model Fusion Speech Wake-up Word Detection Method Based on Ghost-SE-Res2Net
Speech Wake-up Word Detection(WWD)is a key technology in the field of voice interaction.Choosing an appropriate detection window size significantly affects the performance of WWD.This study proposes a novel multi-model fusion method.By fusing the detection results obtained with small and large detection windows,the WWD performance can be improved.The multi-model fusion method includes two classification models that use small and large detection windows,and both are based on a lightweight SE-Res2Net network,namely,Ghost-SE-Res2Net.The multi-scale mechanism of the Squeeze and Excitation Network(SE-Res2Net)structure significantly improves the WWD performance.In Ghost-SE-Res2Net,first the Ghost convolution is used to replace the ordinary convolution in SE-Res2Net to reduce the model parameter count.Subsequently,an attention pooling layer is used to replace the global average pooling layer to further improve the WWD performance.During detection,the maximum value of the detection results obtained from three consecutive small-detection window models is fused with the detection result obtained from one large-detection window model to determine whether the wake-up word is triggered.In this study,a hard sample mining algorithm is introduced during training to selectively learn difficult-to-detect wake-up word information and improve the classification model detection performance.Accordingly,the system performance is evaluated using the Mobvoi dataset containing two wake-up words.The experimental results show that at 0.5 false alarms per hour,the system achieved false rejection rates of 0.46%and 0.43%for the two wake-up words,respectively.This performance is on par with that of the state-of-the-art baseline,whereas the system's parameter count is 31%smaller than the baseline.
Wake-up Word Detection(WWD)Ghost blockRes2Net structureFalse Rejections(FR)multi-model fusion