Chinese Text Data Augmentation Technology Based on Online Translation
Data augmentation is a common method in the field of few shot learning.For text data,a common way of augmenta-tion is back translation.Through the neural translator,the data is translated into an intermediate language,and then is translated in-to the original language.However,limited by the quantity and quality of open parallel corpora,it is difficult for individual research-ers to train qualified neural translators.In order to solve the dependence of back translation on parallel corpus,this paper proposes a text data augmentation technology based on online translation.Taking Baidu translation as an example,this paper studies the bene-fits brought by different intermediate languages and the most suitable augmentation multiple under different data scenario,and stud-ies the label effectiveness of augmentation data through visualization.Experiments show that the Chinese text data augmentation tech-nology based on online translation achieves consistent improvement across four Chinese classification tasks,and the improvement is more obvious in small data sets.On average,the use of augmentation techniques increases F1 by more than 5%.At the same time,this paper points out the irrationality of the previous evaluation of data augmentation benefits.And the improved evaluation setup is put forward.
data augmentationnatural language processingback translationtext classification