首页|语音文本对齐技术构建蒙古语语音识别语料库研究

语音文本对齐技术构建蒙古语语音识别语料库研究

扫码查看
目前,适用于蒙古语的语音识别数据在规模上与英语、汉语的训练数据存在着巨大的差距.因此需要一种低成本的数据集构建方法,以补全数据来源上的短板.在生活交往中已生成了海量的蒙古语数据资源,其中很多都是语音文本粗略对照的形式,本研究采用从这样的语料中提炼可供训练用的语料的技术路线,选择电视剧配音剧本和对应成片作为样例,将提炼工作看作是一个语音文本对齐问题.通过一系列自动化处理将剧本和对应的音频转换为适用于语音文本对齐处理的数据形式,利用迭代的对齐方法得到了语音文本对齐结果,利用这些结果生成了适用于蒙古语语音识别的逐句对齐的"语音—文本对"数据.通过对生成的数据进行抽样检查发现,生成的数据有较好的质量,与人工标注基本一致,节省了数据生产的成本.
Research on the Construction of Mongolian Speech Recognition Corpus Based on Speech-Text Alignment Technology
At present,there is a huge gap between the speech recognition data applicable to Mongo-lian and the training data of English and Chinese in terms of scale.Therefore,a low-cost dataset construction method is needed to make up for the shortcomings in data.Considering the huge amount of Mongolian language data resources generated in life interactions,many of them are in the form of rough controls of speech texts.The experiments adopt the technical route of extracting an annotated corpus from the raw corpus that can be used for training,and the TV dubbing script and the corre-sponding finished film are selected as samples of such a raw corpus.The raw corpus refinement is considered as a phonetic text alignment problem.Through a series of automated processes,the script and the corresponding audio are converted into a data form suitable for speech-text alignment pro-cessing,and an iterative alignment method is used to obtain the speech-text alignment results,thus generating"speech-text pairs"for Mongolian speech recognition.A sample check of the generated data reveale that the generated data has good quality and is basically consistent with manual annota-tion,saving the cost of data production.

speech recognitionMongolianraw corpusspeech-text alignment

甄兆博、张晖

展开 >

蒙古文智能信息处理技术国家地方联合T程研究中心,内蒙古呼和浩特 010020

内蒙古自治区蒙古文信息处理技术重点实验室,内蒙古呼和浩特 010020

内蒙古大学计算机学院,内蒙古呼和浩特 010021

语音识别 蒙古语 生语料 语音文本对齐

2024

中央民族大学学报(自然科学版)
中央民族大学

中央民族大学学报(自然科学版)

影响因子:0.462
ISSN:1005-8036
年,卷(期):2024.33(1)
  • 17