面向深度学习的高质量纠错语料库自动生成方法研究

Automatic Generation of High Quality Error Correction Corpus for Deep Learning

张梅 ¹纪天啸¹

扫码查看

作者信息

1. 北方工业大学信息学院,北京100144
折叠

摘要

基于海量数据驱动的学习模型是目前的主流算法,但通常需要大量的标注数据驱动模型才能获得精准的结果.然而,高质量标注数据获取是一个繁琐且耗时的过程,缺乏足够的标注数据会严重影响算法的性能.本文以信息检索中的查询文本串为例,探索自动数据标注方法.针对中文查询串纠错任务,本文提出了一种新颖的语料自动生成方法,该方法结合了规则筛选和循环神经网络的优点,使用了两种方式来构造音素资源库.同时结合考虑用户的输入习惯和发音等特点,用规则模拟出更多种类的输入错误,使用神经网络语言模型将拼音转换为文字.通过这种方法可以生成更广泛的错误类别,有助于提高机器学习算法的性能.实验结果表明,采用序列到序列的数据生成模型,在自动生成纠错语料库方面具有良好的效果,生成的语料能够模拟真实语料中发生的错误情况,可以有效地提高模型的训练效果.

Abstract

At present, large data-driven learning algorithm is a research hotspot, but these algorithms often needs a large amount of labeled data.It is difficult to find a large number of labeled data.However obtaining high-quality annotation data is a tedious and time-consuming process, and the lack of sufficient annotation data can seriously affect the performance of the algorithm.Taking the automatic generation of query error correction training data as an example, which is crucial for information retrieval.A method of training data generation based on rule statistics and Recurrent Neural Network is proposed.The phoneme resource database is constructed by rule-based phoneme resource construction and statistics-based phoneme resource construction.According to user's input habits and pronunciation characteristics, user's input errors are simulated.A neural network language model is used.It converts Pinyin into possible text, generating a wider range of error categories.The experimental results show that the automatic generated corpus has a good training effect on the benchmark Seq2Seq error correction model.Query-oriented error correction as an example has not been performed into other fields.The generated corpus can simulate the real corpus, and the generated corpus also can effectively train the model.

关键词

语料库自动构建/查询纠错/数据标注/深度学习

Key words

automatic construction of corpus/query error correction/data labeling/deep learning

引用本文复制引用

出版年

2024

北方工业大学学报

北方工业大学

北方工业大学学报

影响因子：0.368

ISSN：1001-5477

段落导航