基于改进Conformer的新闻领域端到端语音识别

End-to-End Speech Recognition in News Field Based on Conformer

张济民 ¹早克热·卡德尔 ¹艾山·吾买尔 ¹申云飞 ²汪烈军¹

扫码查看

作者信息

1. 新疆大学信息科学与工程学院,新疆乌鲁木齐 830001;新疆大学新疆多语种信息技术实验室,新疆乌鲁木齐 830001
2. 新疆大学新疆多语种信息技术实验室,新疆乌鲁木齐 830001;新疆大学软件学院,新疆乌鲁木齐 830001
折叠

摘要

目前,开源的中文语音识别数据集大多面向通用领域,缺少面向新闻领域的开源语音识别语料库,因此该文构建了面向新闻领域的中文语音识别数据集 CH_NEWS_ASR,并使用 ESPNET-0.9.6 框架的 RNN、Transformer和Conformer等模型对数据集的有效性进行了验证,实验表明,该文所构建的语料在最好的模型上CER为 4.8%,SER为 39.4%.由于新闻联播主持人说话语速相对较快,该文构建的数据集文本平均长度为 28 个字符,是 Aishell_1 数据集文本平均长度的 2 倍;且以往的研究中训练目标函数通常为基于字或词水平,缺乏明确的句子水平关系,因此该文提出了一个句子层级的一致性模块,与Conformer模型结合,直接减少源语音和目标文本的表示差异,在开源的 Aishell_1 数据集上其 CER 降低 0.4%,SER 降低 2%;在 CH_NEWS_ASR 数据集上其CER降低 0.9%,SER降低 3%,实验结果表明,该方法在不增加模型参数量的前提下能有效提升语音识别的质量.

Abstract

The open source Chinese speech recognition data sets are usually developed for the general domain.This paper constructs a news-oriented Chinese speech recognition data set named CH_NEWS_ASR,and verifies the va-lidity of the data set by the RNN,Transformer and Conformer models under ESPNET-0.9.6 framework.As news broadcasters speak relatively fast,the average text length in this dataset is 28 characters,which is 2 times of the av-erage text length of Aishell_1 dataset.In this paper,we propose a sentence-level consistency module combined with the Conformer model to directly reduce the representation differences between source speech and target text.Experi-ments demonstrate that,on the Aishell_1 dataset,the CER is reduced by 0.4%and the SER by 2%;on the CH_NEWS_ASR dataset,the CER is reduced by 0.9%and the SER by 3%.

关键词

端到端语音识别/Conformer/句子层级一致性

Key words

end-to-end speech recognition/conformer/sentence-level agreement

引用本文复制引用

基金项目

新疆维吾尔自治区科技创新领军人才项目——高层次领军人才(2022TSYCLJ0036)

出版年

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCSCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

参考文献量18

段落导航