Minimizing Speech Datasets Based on Word Coverage Rate
To address the issue of high collection and training costs for training high-performance automatic speech recognition models,a method based on word coverage is proposed to minimize the data size required for the training set.This method introduces the concept of vector space models,mapping all corpus texts to a high-dimensional space,and selecting the text data with the lowest similarity by calculating the cosine distance between vectors.Then,collect audio based on the selected text data to achieve the best recognition effect using as little audio data as possible.Finally,using Hamming overlap to calculate the amount of newly added vocabulary to evaluate contribution,in order to opti-mize the selection method of cosine distance.The experiment shows that compared to the random speech training set filtering method,the pro-posed method can achieve the same word coverage while saving 21.31%of training data,and there is a strong positive correlation between the word coverage of the training set and the inference performance of the model obtained from the training set.This proves that while maintaining similar inference performance,it can effectively save the collection and training costs of the speech training set,thereby promoting the further development of automatic speech recognition technology.
automatic speech recognitionvector space modelcosine distanceHamming weighttraining set minimization