电脑与信息技术2024,Vol.32Issue(6) :38-42.

基于深度学习融合音频与文本的双模态情感识别方法

Dual-modal Recognition Method of Emotion Combining Speech with Text Based on Deep Learning

刘泽昊 董胡 赵新民 钱盛友
电脑与信息技术2024,Vol.32Issue(6) :38-42.

基于深度学习融合音频与文本的双模态情感识别方法

Dual-modal Recognition Method of Emotion Combining Speech with Text Based on Deep Learning

刘泽昊 1董胡 2赵新民 1钱盛友1
扫码查看

作者信息

  • 1. 湖南师范大学 物理与电子科学学院,湖南 长沙 410081
  • 2. 长沙师范学院,湖南 长沙 410100
  • 折叠

摘要

针对人机交互中情感识别的精度不高以及无法充分利用不同模态特征的问题,提出了一种基于深度学习融合音频和文本两种特征的语音情感识别方法.将语音和文本两种模态的情感识别模块在特征级别进行融合得到STE-ER模型.在公开数据集IEMOCAP上的实验结果表明,SPEECH模块采用HuBERT提取特征较语谱图法可提升情感识别率 7.1%;TEXT模块所采用的BERT相较Word2Vec可提升情感识别率 5.1%;SPEECH和TEXT模块进行不同策略融合后相较于两个独立的模块,情感识别精度均得到了明显提升,其中特征级别融合的STE-ER模型较最大置信度决策级融合的识别率提高了 5.2%.

Abstract

Aiming at the problem of low accuracy of emotion recognition in human-computer interaction and the inability to make full use of different modal features,a speech emotion recognition method based on deep learning fusion of audio and text features was proposed.The STE-ER model was obtained by fusing the speech and text emotion recognition modules at the feature level.The experimental results on the public data set IEMOCAP showed that the SPEECH module used HuBERT to extract features,which could improve the emotional recognition rate by 7.1%compared with the spectrogram method;the BERT used in the TEXT module could improve the emotion recognition rate by 5.1%compared with Word2Vec.Compared with the two independent modules,the emotion recognition accuracy of SPEECH and TEXT modules was significantly improved after different strategies were fused.The recognition rate of STE-ER model with feature level fusion is 5.2%higher than that of maximum confidence decision level fusion.

关键词

情感识别/语音/文本/特征级别融合/深度学习

Key words

emotion recognition/speech/text/feature-level ensemble/deep learning

引用本文复制引用

出版年

2024
电脑与信息技术
中国电子学会,湖南省电子研究所

电脑与信息技术

影响因子:0.256
ISSN:1005-1228
段落导航相关论文