首页|Optimizing Whisper models for Amazigh ASR: a comparative analysis
Optimizing Whisper models for Amazigh ASR: a comparative analysis
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
Abstract Recent breakthroughs in Natural Language Processing have significantly enhanced the presence of Automatic Speech Recognition (ASR) systems in daily life. The advent of transformer-based architectures has revolutionized the field, providing substantial advancements in model performance and capabilities. Despite these advancements, ASR models for low-resource languages remain underdeveloped due to the scarcity of pertinent data. This research addresses this gap by focusing on the Amazigh language, a low-resource language with limited digital tools for ASR. We collected a dataset of 11,644 audio files comprising 300 classes, including 200 isolated words and 100 short sentences, from Amazigh Tarifit accent speakers in the Al Hoceima region of northern Morocco. Utilizing the pretrained Whisper encoder model with a classification head of five fully connected layers, we aimed to enhance ASR performance through fine-tuning. Our best system achieved an accuracy of 98.51% when fine-tuning the encoder of the multilingual small Whisper model on our collected datasets. Our results demonstrate the superior performance of the fine-tuned Whisper model compared to baseline models such as 2DCNN, 2DCNN-LSTM, and encoder-only architectures trained from scratch. This study not only underscores the effectiveness of fine-tuning strategies in improving ASR performance but also highlights the potential of the Whisper model for multilingual ASR applications, particularly in under-resourced languages like Amazigh.