Speech segmentation method based on speech semantic guidance
[Objective]Speech segmentation aims to split audio stream or longer audio into shorter segments and constitutes a crucial step in speech translation tasks.Proper segmentation ensures that these audio segments maintain their complete semantics,thus allowing the speech translation model to focus on the entire contextual information within each sentence,and thereby producing improved translation results.[Methods]Herein we propose a speech segmentation method based on phonetic semantic guidance,and employ a HuBERT-based frame classifier to categorize audio frames.Also we determine the likelihood of each frame being speech or non-speech,and use the ipDAC algorithm to recursively partition the audio to achieve desired segmentation.[Results]Compared to those existing methods,the proposed method has achieved a improvement of 0.6 percent points in BLEU score on the Must-C En-Vi translation dataset.[Conclusions]Through a comparative analysis of various segmentation techniques,we demonstrate that the proposed approach effectively reduces the performance degradation in the speech translation model during the decoding process.
speech translationspeech segmentationHuBERT pre-train model