首页|基于在线翻译的中文文本数据增强技术

基于在线翻译的中文文本数据增强技术

扫码查看
数据增强是少样本学习领域中的一种常见方法,对于文本数据,一种通用的增强方式是反译,通过神经翻译机,将数据翻译为某种中间语言,再翻译为原语言。但受限于公开平行语料库的数量与质量,个人研究者很难训练出符合要求的神经翻译机。为了解决反译法对平行语料库的依赖,论文提出了一项基于在线翻译的文本数据增强技术。该文以百度翻译为例,研究了不同中间语言带来的收益,以及不同数据量下,最适合的增强倍数,并通过可视化的方式研究了增强数据的标签有效性。实验表明,基于在线翻译的中文文本数据增强技术,在四个中文分类任务中获得了一致提升,提升在小数据集上更为明显。平均而言,使用增强技术使F1值提升超过了5%。同时该文指出了以往评估数据增强收益的不合理之处,并提出了改进的评估设置。
Chinese Text Data Augmentation Technology Based on Online Translation
Data augmentation is a common method in the field of few shot learning.For text data,a common way of augmenta-tion is back translation.Through the neural translator,the data is translated into an intermediate language,and then is translated in-to the original language.However,limited by the quantity and quality of open parallel corpora,it is difficult for individual research-ers to train qualified neural translators.In order to solve the dependence of back translation on parallel corpus,this paper proposes a text data augmentation technology based on online translation.Taking Baidu translation as an example,this paper studies the bene-fits brought by different intermediate languages and the most suitable augmentation multiple under different data scenario,and stud-ies the label effectiveness of augmentation data through visualization.Experiments show that the Chinese text data augmentation tech-nology based on online translation achieves consistent improvement across four Chinese classification tasks,and the improvement is more obvious in small data sets.On average,the use of augmentation techniques increases F1 by more than 5%.At the same time,this paper points out the irrationality of the previous evaluation of data augmentation benefits.And the improved evaluation setup is put forward.

data augmentationnatural language processingback translationtext classification

王小天、奚彩萍

展开 >

江苏科技大学电子信息学院 镇江 212000

数据增强 自然语言处理 反译 文本分类

2024

计算机与数字工程
中国船舶重工集团公司第七0九研究所

计算机与数字工程

CSTPCD
影响因子:0.355
ISSN:1672-9722
年,卷(期):2024.52(3)
  • 15