Choosing a corpus with rich phonetic phenomena is the key to improve the performance of speech rec-ognition.In order to construct the text corpus of Kyrgyz speech recognition system,firstly,the noise information in the text is removed by pre-processing technology,and the Kyrgyz language is converted into Latin form by text conversion algorithm.Secondly,according to the syllable structure and rules of Kyrgyz language,the heuristic function and two optimal algorithms for automatically selecting sentences are proposed.Finally,in order to verify the effectiveness of the algorithm,two groups of sentence sets with different numbers are used as experimental corpora,two algorithms are used to generate the optimal sentence sets,and the corpora generated by the two algorithms are counted.The experi-mental results show that the coverage rate of tri-phones in the text selected by algorithm 2 reaches 78.70%,which can meet the needs of speech recognition system,and the effectiveness of the algorithm proposed in this paper is veri-fied.
关键词
三音子/语音识别/语料库/柯尔克孜语
Key words
Tri-phone/Speech recognition/Corpus/Kyrgyz language