Transfer learning of self-supervised models for Minana dialect
[Objective]The Minnan dialect,spoken by over 70 million people globally,is acknowledged as one of the seven major Chinese dialects.However,the speech recognition technology predominantly focuses on Mandarin,with limited research on Minnan.This research faces significant challenges due to scarce training data.Suitable recordings are limited,primarily sourced from few local TV stations and hindered by copyright issues.Additionally,due to the shortage of proficient Minnan annotators,the Minnan annotation becomes both difficult and costly.Consequently,developing an efficient and accurate speech recognition system for the Minnan dialect with limited data resources is considered as an urgent and critical issue.[Methods]This paper presents transfer learning approaches to apply Chinese self-supervised(SS)models(e.g.,Wav2vec 2.0,HuBERT)to the Southern Min dialect speech recognition task.During the fine-tuning process,an encoder-decoder framework is constructed by connecting generative pre-trained Transformer models to the encoders of Wav2vec 2.0 and HuBERT.This process is achieved by employing a hybrid CTC/Attention loss function and freezing the parameters of the self-supervised models feature encoder during specific iterative training steps.[Results]Experimental results demonstrate that the HuBERT_CN_MTL model performs superiorly in the Minnan dialect speech recognition task,in comparison with other models.The Wav2vec2_CN model exhibits comparable results to the multilingual Wav2vec2_XLSR53 model.Furthermore,experiments indicate that,in real telephone channel scenarios,results of the SSL model remain stable even when the sampling rate is reduced or factors such as the noise interference are introduced.The HuBERT_CN_MTL model demonstrates superior performance during the transfer learning process when employing a partially frozen encoder strategy,and yields significantly better results than the all-frozen encoder strategy.Furthermore,the character error rate(CER)is reduced by 0.86 percentage points in comparison to the no-freeze strategy.Specifically,in the initial 20%of the training period,encoder parameters are frozen,with only the decoder undergoing training.In the subsequent 80%of the training period,these encoder parameters are unfrozen,with both the encoder and decoder undergoing training.[Conclusions]In this paper,Chinese self-supervised models Wav2vec 2.0 and HuBERT are applied to the speech recognition task of Minnan dialect in order to address the issue of data scarcity and to enhance the performance of dialect recognition.Experimental results demonstrate that selecting a high-resource language more similar to the target task for transfer learning significantly enhances the effectiveness of low-resource speech recognition.By optimizing the hybrid CTC/Attention objective function,the model is able to make full use of the temporal and contextual information present in the speech signal,thereby improving the accuracy of the recognition process.Furthermore,by freezing the number of encoder training steps during the fine-tuning process,the generalizability of the model is enhanced,eliminating issues such as overfitting or interference.It is important to note that,although the Chinese self-supervised models performs well in the task of recognizing Minnan dialect,the room for improvement remains.The reason of this fact lies in that differences continue to exist between Minnan dialect and Mandarin in terms of speech features.Therefore,future research should further explore the specialized self-supervised models for the Minnan dialect as well as more optimal transfer learning strategies.