基于深度学习融合音频与文本的双模态情感识别方法

扫码查看

原文链接

万方数据
维普

中文摘要：针对人机交互中情感识别的精度不高以及无法充分利用不同模态特征的问题,提出了一种基于深度学习融合音频和文本两种特征的语音情感识别方法.将语音和文本两种模态的情感识别模块在特征级别进行融合得到STE-ER模型.在公开数据集IEMOCAP上的实验结果表明,SPEECH模块采用HuBERT提取特征较语谱图法可提升情感识别率 7.1%;TEXT模块所采用的BERT相较Word2Vec可提升情感识别率 5.1%;SPEECH和TEXT模块进行不同策略融合后相较于两个独立的模块,情感识别精度均得到了明显提升,其中特征级别融合的STE-ER模型较最大置信度决策级融合的识别率提高了 5.2%.

外文标题：Dual-modal Recognition Method of Emotion Combining Speech with Text Based on Deep Learning

外文摘要：Aiming at the problem of low accuracy of emotion recognition in human-computer interaction and the inability to make full use of different modal features,a speech emotion recognition method based on deep learning fusion of audio and text features was proposed.The STE-ER model was obtained by fusing the speech and text emotion recognition modules at the feature level.The experimental results on the public data set IEMOCAP showed that the SPEECH module used HuBERT to extract features,which could improve the emotional recognition rate by 7.1%compared with the spectrogram method;the BERT used in the TEXT module could improve the emotion recognition rate by 5.1%compared with Word2Vec.Compared with the two independent modules,the emotion recognition accuracy of SPEECH and TEXT modules was significantly improved after different strategies were fused.The recognition rate of STE-ER model with feature level fusion is 5.2%higher than that of maximum confidence decision level fusion.

外文关键词：

emotion recognitionspeechtextfeature-level ensembledeep learning

作者：

刘泽昊、董胡、赵新民、钱盛友

展开 >

作者单位：

湖南师范大学物理与电子科学学院,湖南长沙 410081

长沙师范学院,湖南长沙 410100

关键词：

情感识别语音文本特征级别融合深度学习

出版年：

2024

电脑与信息技术

中国电子学会,湖南省电子研究所

电脑与信息技术

影响因子：0.256

ISSN：1005-1228

年,卷(期)：2024.32(6)