面向闽南方言的自监督模型迁移学习

Transfer learning of self-supervised models for Minana dialect

扫码查看

原文链接

维普
万方数据

中文摘要：[目的]为了降低低资源闽南方言的语音识别词错误率(character error rate,CER),对中文自监督模型在闽南方言语音识别任务上的微调迁移效果进行研究.[方法]使用两种不同的中文SSL模型Wav2vec 2.0和HuBERT,并采用连接时序(connectionist temporal classification,CTC)和混合CTC/注意力机制(Attention)的迁移学习策略将模型应用于闽言方言的语音识别中.[结果]相比于跨语言迁移方法,本文方法可以使CER降低4.8个百分点以上.[结论]使用更相似的高资源源语言进行迁移学习,可以缓解低资源语音识别面临的资源受限问题,更易获得高性能的闽南方言语音识别模型.

外文摘要：[Objective]The Minnan dialect,spoken by over 70 million people globally,is acknowledged as one of the seven major Chinese dialects.However,the speech recognition technology predominantly focuses on Mandarin,with limited research on Minnan.This research faces significant challenges due to scarce training data.Suitable recordings are limited,primarily sourced from few local TV stations and hindered by copyright issues.Additionally,due to the shortage of proficient Minnan annotators,the Minnan annotation becomes both difficult and costly.Consequently,developing an efficient and accurate speech recognition system for the Minnan dialect with limited data resources is considered as an urgent and critical issue.[Methods]This paper presents transfer learning approaches to apply Chinese self-supervised(SS)models(e.g.,Wav2vec 2.0,HuBERT)to the Southern Min dialect speech recognition task.During the fine-tuning process,an encoder-decoder framework is constructed by connecting generative pre-trained Transformer models to the encoders of Wav2vec 2.0 and HuBERT.This process is achieved by employing a hybrid CTC/Attention loss function and freezing the parameters of the self-supervised models feature encoder during specific iterative training steps.[Results]Experimental results demonstrate that the HuBERT_CN_MTL model performs superiorly in the Minnan dialect speech recognition task,in comparison with other models.The Wav2vec2_CN model exhibits comparable results to the multilingual Wav2vec2_XLSR53 model.Furthermore,experiments indicate that,in real telephone channel scenarios,results of the SSL model remain stable even when the sampling rate is reduced or factors such as the noise interference are introduced.The HuBERT_CN_MTL model demonstrates superior performance during the transfer learning process when employing a partially frozen encoder strategy,and yields significantly better results than the all-frozen encoder strategy.Furthermore,the character error rate(CER)is reduced by 0.86 percentage points in comparison to the no-freeze strategy.Specifically,in the initial 20％of the training period,encoder parameters are frozen,with only the decoder undergoing training.In the subsequent 80％of the training period,these encoder parameters are unfrozen,with both the encoder and decoder undergoing training.[Conclusions]In this paper,Chinese self-supervised models Wav2vec 2.0 and HuBERT are applied to the speech recognition task of Minnan dialect in order to address the issue of data scarcity and to enhance the performance of dialect recognition.Experimental results demonstrate that selecting a high-resource language more similar to the target task for transfer learning significantly enhances the effectiveness of low-resource speech recognition.By optimizing the hybrid CTC/Attention objective function,the model is able to make full use of the temporal and contextual information present in the speech signal,thereby improving the accuracy of the recognition process.Furthermore,by freezing the number of encoder training steps during the fine-tuning process,the generalizability of the model is enhanced,eliminating issues such as overfitting or interference.It is important to note that,although the Chinese self-supervised models performs well in the task of recognizing Minnan dialect,the room for improvement remains.The reason of this fact lies in that differences continue to exist between Minnan dialect and Mandarin in terms of speech features.Therefore,future research should further explore the specialized self-supervised models for the Minnan dialect as well as more optimal transfer learning strategies.

外文关键词：

speech recognitionMinnan dialecttransfer learningself-supervised models

作者：

林佳燕、黄胡恺、卢胜辉、许彬彬、李琳、洪青阳

展开 >

作者单位：

厦门大学信息学院,福建厦门 361005

厦门大学中国语言文学系,福建厦门 361005

厦门大学电子科学与技术学院,福建厦门 361005

关键词：

语音识别闽南方言迁移学习自监督

基金：

国家自然科学基金国家自然科学基金

项目编号：

6227622062371407

出版年：

2024

DOI：

10.6043/j.issn.0438-0479.202311015

厦门大学学报(自然科学版)

厦门大学

厦门大学学报(自然科学版)

CSTPCD北大核心

影响因子：0.449

ISSN：0438-0479

年,卷(期)：2024.63(4)