针对目前环境声音分类(Environmental Sound Classification,ESC)方法对音频特征提取中反映的时频维度信息不足的问题,提出基于多组多分辨率特征和小波通道注意力的分类方法.采用多组多分辨率特征组成的多特征作为网络输入,通过多组滤波器,多个频率分辨率,在时间和频率维度上实现数据增强,同时实现信息互补.为了更好地衡量各个通道的重要性,针对一维音频图像特征设计了小波通道注意力模块,采用离散小波变换(Discrete Wavelet Transform,DWT)将信号的低频子带和高频子带有效结合,得到通道标量,利用Gram-Schmidt正交化方法使网络在信道注意压缩阶段提取的信息多样化,利用长短期记忆(Long Short Term Memory,LSTM)网络长时间保存信息,提高学习的长期可靠性.实验结果表明,在ESC-10和ESC-50数据集上的分类准确度分别达到了 98.7%和93.6%,取得了较好的效果,为音频特征处理提供了一种新的研究思路.
Environmental Sound Classification Based on Multiple Groups of Multi-resolution Features and Wavelet Channel Attention
For the problem of insufficient time-frequency dimension information reflected in audio feature extraction in current Environmental Sound Classification(ESC)methods,a classification method based on multiple groups of multi-resolution features and wavelet channel attention is proposed.Multiple groups of multi-resolution features are used as network inputs,and data augmentation is achieved in both time and frequency dimensions through multiple groups of filters and multiple frequency resolutions,while information complementarity is also achieved;In order to better measure the importance of each channel,a wavelet channel attention module is designed for one-dimensional audio image features.The Discrete Wavelet Transform(DWT)is used to effectively combine the low-frequency and high-frequency subbands of the signal to obtain channel scalars.The Gram-Schmidt orthogonalization method is used to diversify the information extracted by the network during the channel attention compression stage.The Long Short Term Memory(LSTM)network is utilized to store information for a long time and improve the long-term reliability of learning.The experimental results show that the classification accuracy of the ESC-10 and ESC-50 datasets reach 98.7%and 93.6%,respectively,achieving good results and providing a new research approach for audio feature processing.
ESCmultiple groups of multi-resolution featureswavelet channel attentionLSTM network