首页期刊导航|International journal of speech technology
期刊信息/Journal information
International journal of speech technology
Kluwer Academic Publishers
International journal of speech technology

Kluwer Academic Publishers

季刊

1381-2416

International journal of speech technology/Journal International journal of speech technologyEIESCI
正式出版
收录年代

    Unsupervised phoneme segmentation of continuous Arabic speech

    Hind Ait MaitNoureddine Aboutabit
    1-12页
    查看更多>>摘要:Abstract The development of a speech recognition system for the Arabic language presents a significant challenge, mainly due to the limited availability of digital resources specific to this language. To achieve vocabulary-independent speech recognition, it is essential to split a given speech into smaller units known as phonemes or syllables. This process is what we call speech segmentation; it plays a crucial role in accurately recognizing and understanding speech patterns. Many speech segmentation techniques have been developed, relying on linguistic information such as phonetic transcription. However, for real-time systems, phonetic transcription is not always available, especially for low-resource languages like Arabic. In this paper, we addressed the problem of unsupervised segmentation for continuous Arabic speech based on two distinct approaches: spectral contrast and the first derivative of Mel Frequency Cepstrum Coefficients (Δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta$$\end{document}-MFCCs), incorporating an adaptive threshold to determine the phoneme boundaries without needing any external knowledge. The intersection-over-union (IoU) was integrated into our study to match reference boundaries and the generated borders. As far as we know, the majority of works use this technique for image processing; we effectively adapted it to the realm of speech processing, accomplishing a noteworthy level of matching success. The Arabic speech corpus was used to evaluate the efficacy of our proposed methods. By comparing the performance of our methods to other approaches, we were able to assess their strengths. Notably, our method, which made use of Δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta$$\end{document}-MFCCs, showcased a substantial performance improvement of 14%.

    Unsupervised phoneme segmentation of continuous Arabic speech

    Hind Ait MaitNoureddine Aboutabit
    1-12页
    查看更多>>摘要:Abstract The development of a speech recognition system for the Arabic language presents a significant challenge, mainly due to the limited availability of digital resources specific to this language. To achieve vocabulary-independent speech recognition, it is essential to split a given speech into smaller units known as phonemes or syllables. This process is what we call speech segmentation; it plays a crucial role in accurately recognizing and understanding speech patterns. Many speech segmentation techniques have been developed, relying on linguistic information such as phonetic transcription. However, for real-time systems, phonetic transcription is not always available, especially for low-resource languages like Arabic. In this paper, we addressed the problem of unsupervised segmentation for continuous Arabic speech based on two distinct approaches: spectral contrast and the first derivative of Mel Frequency Cepstrum Coefficients (Δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta$$\end{document}-MFCCs), incorporating an adaptive threshold to determine the phoneme boundaries without needing any external knowledge. The intersection-over-union (IoU) was integrated into our study to match reference boundaries and the generated borders. As far as we know, the majority of works use this technique for image processing; we effectively adapted it to the realm of speech processing, accomplishing a noteworthy level of matching success. The Arabic speech corpus was used to evaluate the efficacy of our proposed methods. By comparing the performance of our methods to other approaches, we were able to assess their strengths. Notably, our method, which made use of Δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta$$\end{document}-MFCCs, showcased a substantial performance improvement of 14%.

    Speaker diarization based on X vector extracted from time-delay neural networks (TDNN) using agglomerative hierarchical clustering in noisy environment

    K. V. Aljinu KhadarR. K. Sunil KumarV. V. Sameer
    13-26页
    查看更多>>摘要:Abstract This paper introduces a speaker diarization system using speaker embedding parameters, specifically the x-vector. By incorporating auto-correlated MFCC features for x-vector extraction using a pre-trained time delay neural network, the system exhibits enhanced adaptability to noise variations. Speaker clustering is accomplished through agglomerative clustering with PLDA scoring as the distance metric, making the system particularly valuable for potential speaker identification, especially in forensic applications. The system’s noise adaptability is thoroughly evaluated by integrating various types of noise such as Red, Pink, and white noise across a wide range of signal-to-noise ratios (20 dB to − 20 dB). Additionally, the system’s performance is comprehensively assessed by varying speech duration and adjusting the number of speakers, highlighting its robustness and effectiveness in real-world scenarios.

    Speaker diarization based on X vector extracted from time-delay neural networks (TDNN) using agglomerative hierarchical clustering in noisy environment

    K. V. Aljinu KhadarR. K. Sunil KumarV. V. Sameer
    13-26页
    查看更多>>摘要:Abstract This paper introduces a speaker diarization system using speaker embedding parameters, specifically the x-vector. By incorporating auto-correlated MFCC features for x-vector extraction using a pre-trained time delay neural network, the system exhibits enhanced adaptability to noise variations. Speaker clustering is accomplished through agglomerative clustering with PLDA scoring as the distance metric, making the system particularly valuable for potential speaker identification, especially in forensic applications. The system’s noise adaptability is thoroughly evaluated by integrating various types of noise such as Red, Pink, and white noise across a wide range of signal-to-noise ratios (20 dB to − 20 dB). Additionally, the system’s performance is comprehensively assessed by varying speech duration and adjusting the number of speakers, highlighting its robustness and effectiveness in real-world scenarios.

    Optimizing Whisper models for Amazigh ASR: a comparative analysis

    Mohamed DaouadFadoua Ataa AllahEl Wardani Dadi
    27-37页
    查看更多>>摘要:Abstract Recent breakthroughs in Natural Language Processing have significantly enhanced the presence of Automatic Speech Recognition (ASR) systems in daily life. The advent of transformer-based architectures has revolutionized the field, providing substantial advancements in model performance and capabilities. Despite these advancements, ASR models for low-resource languages remain underdeveloped due to the scarcity of pertinent data. This research addresses this gap by focusing on the Amazigh language, a low-resource language with limited digital tools for ASR. We collected a dataset of 11,644 audio files comprising 300 classes, including 200 isolated words and 100 short sentences, from Amazigh Tarifit accent speakers in the Al Hoceima region of northern Morocco. Utilizing the pretrained Whisper encoder model with a classification head of five fully connected layers, we aimed to enhance ASR performance through fine-tuning. Our best system achieved an accuracy of 98.51% when fine-tuning the encoder of the multilingual small Whisper model on our collected datasets. Our results demonstrate the superior performance of the fine-tuned Whisper model compared to baseline models such as 2DCNN, 2DCNN-LSTM, and encoder-only architectures trained from scratch. This study not only underscores the effectiveness of fine-tuning strategies in improving ASR performance but also highlights the potential of the Whisper model for multilingual ASR applications, particularly in under-resourced languages like Amazigh.

    Optimizing Whisper models for Amazigh ASR: a comparative analysis

    Mohamed DaouadFadoua Ataa AllahEl Wardani Dadi
    27-37页
    查看更多>>摘要:Abstract Recent breakthroughs in Natural Language Processing have significantly enhanced the presence of Automatic Speech Recognition (ASR) systems in daily life. The advent of transformer-based architectures has revolutionized the field, providing substantial advancements in model performance and capabilities. Despite these advancements, ASR models for low-resource languages remain underdeveloped due to the scarcity of pertinent data. This research addresses this gap by focusing on the Amazigh language, a low-resource language with limited digital tools for ASR. We collected a dataset of 11,644 audio files comprising 300 classes, including 200 isolated words and 100 short sentences, from Amazigh Tarifit accent speakers in the Al Hoceima region of northern Morocco. Utilizing the pretrained Whisper encoder model with a classification head of five fully connected layers, we aimed to enhance ASR performance through fine-tuning. Our best system achieved an accuracy of 98.51% when fine-tuning the encoder of the multilingual small Whisper model on our collected datasets. Our results demonstrate the superior performance of the fine-tuned Whisper model compared to baseline models such as 2DCNN, 2DCNN-LSTM, and encoder-only architectures trained from scratch. This study not only underscores the effectiveness of fine-tuning strategies in improving ASR performance but also highlights the potential of the Whisper model for multilingual ASR applications, particularly in under-resourced languages like Amazigh.

    Deep learning countermeasures for detecting replay speech attacks: a review

    Suresh VeesaMadhusudan Singh
    39-51页
    查看更多>>摘要:Abstract Automatic speaker verification (ASV) systems are widely accepted for biometric authentication in real-time applications. However, such ASV systems need robust protection against well-known replay attacks, especially for highly sensitive applications. This study provides a detailed overview of existing deep learning based solutions developed for replay attack detection task. Convolution neural network architectures are most widely explored for countermeasure development. In this study, existing deep learning frameworks are categorized into four groups, namely, spectrogram based deep neural networks (DNN), handcraft features based DNN, source features based DNN and end-to-end DNN frameworks. They have demonstrated notable performances in replay speech detection context. However, their generalisation remains questionable due to the potential challenges posed by unknown types of replay attacks in the future. The study highlights that excitation source features explorations in DNN framework are limited. The study also discussed few existing Gaussian mixture model (GMM) based excitation source explorations, indicating better results may be obtained over existing solutions, using alternative source features based GMM/DNN methods in future. Hence, excitation source information (either implicit or explicit form) may be further explored with GMM/DNN methods. The study concludes with a discussion on potential of excitation source information in replay detection context, following with possible future directions using source information.

    Deep learning countermeasures for detecting replay speech attacks: a review

    Suresh VeesaMadhusudan Singh
    39-51页
    查看更多>>摘要:Abstract Automatic speaker verification (ASV) systems are widely accepted for biometric authentication in real-time applications. However, such ASV systems need robust protection against well-known replay attacks, especially for highly sensitive applications. This study provides a detailed overview of existing deep learning based solutions developed for replay attack detection task. Convolution neural network architectures are most widely explored for countermeasure development. In this study, existing deep learning frameworks are categorized into four groups, namely, spectrogram based deep neural networks (DNN), handcraft features based DNN, source features based DNN and end-to-end DNN frameworks. They have demonstrated notable performances in replay speech detection context. However, their generalisation remains questionable due to the potential challenges posed by unknown types of replay attacks in the future. The study highlights that excitation source features explorations in DNN framework are limited. The study also discussed few existing Gaussian mixture model (GMM) based excitation source explorations, indicating better results may be obtained over existing solutions, using alternative source features based GMM/DNN methods in future. Hence, excitation source information (either implicit or explicit form) may be further explored with GMM/DNN methods. The study concludes with a discussion on potential of excitation source information in replay detection context, following with possible future directions using source information.

    Exploring data augmentation for Amazigh speech recognition with convolutional neural networks

    Hossam BoulalFarida BouroumaneMohamed HamidiJamal Barkani...
    53-65页
    查看更多>>摘要:Abstract In the field of speech recognition, enhancing accuracy is paramount for diverse linguistic communities. Our study addresses this necessity, focusing on improving Amazigh speech recognition through the implementation of three distinct data augmentation methods: Audio Augmentation, FilterBank Augmentation, and SpecAugment. Leveraging Convolutional Neural Networks (CNNs) for speech recognition, we utilize Mel Spectrograms extracted from audio files. The study specifically targets the recognition of the initial ten Amazigh digits. We conducted experiments with a speaker-independent approach involving 42 participants. A total of 27 experiments were conducted, utilizing both original and augmented data. Among the different CNN models employed, the VGG19 model showcased significant promise. Our results demonstrate a maximum accuracy of 95.66%. Furthermore, the most notable improvement achieved through data augmentation was 4.67%. These findings signify a substantial enhancement in speech recognition accuracy, indicating the efficacy of the proposed methods.

    Exploring data augmentation for Amazigh speech recognition with convolutional neural networks

    Hossam BoulalFarida BouroumaneMohamed HamidiJamal Barkani...
    53-65页
    查看更多>>摘要:Abstract In the field of speech recognition, enhancing accuracy is paramount for diverse linguistic communities. Our study addresses this necessity, focusing on improving Amazigh speech recognition through the implementation of three distinct data augmentation methods: Audio Augmentation, FilterBank Augmentation, and SpecAugment. Leveraging Convolutional Neural Networks (CNNs) for speech recognition, we utilize Mel Spectrograms extracted from audio files. The study specifically targets the recognition of the initial ten Amazigh digits. We conducted experiments with a speaker-independent approach involving 42 participants. A total of 27 experiments were conducted, utilizing both original and augmented data. Among the different CNN models employed, the VGG19 model showcased significant promise. Our results demonstrate a maximum accuracy of 95.66%. Furthermore, the most notable improvement achieved through data augmentation was 4.67%. These findings signify a substantial enhancement in speech recognition accuracy, indicating the efficacy of the proposed methods.