Automatic Generation of High Quality Error Correction Corpus for Deep Learning
At present, large data-driven learning algorithm is a research hotspot, but these algorithms often needs a large amount of labeled data.It is difficult to find a large number of labeled data.However obtaining high-quality annotation data is a tedious and time-consuming process, and the lack of sufficient annotation data can seriously affect the performance of the algorithm.Taking the automatic generation of query error correction training data as an example, which is crucial for information retrieval.A method of training data generation based on rule statistics and Recurrent Neural Network is proposed.The phoneme resource database is constructed by rule-based phoneme resource construction and statistics-based phoneme resource construction.According to user's input habits and pronunciation characteristics, user's input errors are simulated.A neural network language model is used.It converts Pinyin into possible text, generating a wider range of error categories.The experimental results show that the automatic generated corpus has a good training effect on the benchmark Seq2Seq error correction model.Query-oriented error correction as an example has not been performed into other fields.The generated corpus can simulate the real corpus, and the generated corpus also can effectively train the model.
automatic construction of corpusquery error correctiondata labelingdeep learning