基于焦点损失的ATCN-GRU语音情感识别
Speech Emotion Recognition Using Focal Loss-Based ATCN-GRU
樊永红 1黄鹤鸣 1张会云1
作者信息
- 1. 青海师范大学计算机学院,青海 西宁 810008;藏语智能信息处理及应用国家重点实验室,青海 西宁 810008
- 折叠
摘要
为了改善RNN的空间信息丢失和CNN忽略时序信息的问题,引入了时间卷积网络TCN,将上述网络与双向门控循环单元Bi-GRU以及注意力机制组合构建了声学模型ATCN-GRU来进一步提高语音情感识别的性能,并通过加入焦点损失改善EMODB和IEMOCAP数据库训练样本不平均导致的识别结果不均衡问题.首先,通过TCN残差块从手工提取的特征中选取最具有代表性和鲁棒性的特征;其次,利用Bi-GRU模型学习语音样本的上下文相关信息,并利用注意力机制学习模型的输入序列与输出序列之间的关联程度,从而给予有效信息更多关注;最后,通过Softmax层对情感进行分类.相较于前人的研究成果,模型 ATCN-GRU 取得了更好的识别性能:在 CASIA、EMODB 以及 IEMOCAP 三个数据库上分别取得了88.17%、85.98%和 65.83%的平均准确率;引入焦点损失后,EMODB 和 IEMOCAP 数据库上的平均准确率分别达到了86.26%和 66.30%.
Abstract
In order to improve the spatial information loss of RNN and the neglect of time series information by CNN,the temporal convolutional network TCN is introduced,which combines the network with the Bi-GRU and atten-tion mechanism to construct the acoustic model ATCN-GRU to further improve the performance of speech emotion recognition,and focal loss is introduced to improve the unbalanced recognition results caused by uneven training sam-ples in the EMODB and IEMOCAP databases.Firstly,the most representative and robust features were extracted from the manually extracted features by TCN residual block.Secondly,Bi-GRU was used to learn context-related informa-tion,and an attention mechanism was introduced to learn the correlation between input and output sequences of the model,so as to give more attention to the effective information.Finally,emotions were classified with Softmax layer.Compared with previous research results,the model ATCN-GRU achieves better recognition performance:the average accuracy rates of 88.17%,85.98%and 65.83%are achieved on CASIA,EMODB and IEMOCAP databases,respec-tively.After introducing focus loss,the average accuracy rates of EMODB and IEMOCAP databases reache 86.26%and 66.30%,respectively.
关键词
语音情感识别/时间卷积网络/双向门控循环单元/注意力机制/焦点损失Key words
Speech emotion recognition/Time convolution network/Bi-direction gated recurrent unit/Attentional mechanism/Focal loss引用本文复制引用
出版年
2024