首页|一种建立在GPT-2模型上的数据增强方法

一种建立在GPT-2模型上的数据增强方法

扫码查看
针对句子分类任务常面临着训练数据不足,而且文本语言具有离散性,在语义保留的条件下进行数据增强具有一定困难,语义一致性和多样性难以平衡的问题,本文提出一种惩罚生成式预训练语言模型的数据增强方法(punishing generative pre-trained transformer for data augmentation,PunishGPT-DA).设计了惩罚项和超参数α,与负对数似然损失函数共同作用微调GPT-2(generative pre-training 2.0),鼓励模型关注那些预测概率较小但仍然合理的输出;使用基于双向编码器表征模型(bidirectional encoder representation from transformers,BERT)的过滤器过滤语义偏差较大的生成样本.本文方法实现了对训练集 16 倍扩充,与GPT-2 相比,在意图识别、问题分类以及情感分析 3 个任务上的准确率分别提升了 1.1%、4.9%和 8.7%.实验结果表明,本文提出的方法能够同时有效地控制一致性和多样性需求,提升下游任务模型的训练性能.
A data augmentation method built on GPT-2 model
The sentence classification task often faces the problem of insufficient training data.Moreover,text language is discrete,and it is difficult to perform data augmentation under the condition of semantic preservation.Balancing se-mantic consistency and diversity is also challenging.To address these issues,this paper proposes a punishing generative pre-trained transformer for data augmentation,PunishGPT-DA for short.A penalty term and hyperparameter α are de-signed.They work together with the negative log-likelihood loss function to fine tune GPT-2(generative pre-training 2.0)and encourage the model to focus on the outputs with small predicted probabilities but still reasonable.A filter based on BERT(bidirectional encoder representation from transformers)is used to remove generated samples with sig-nificant semantic bias.The method has achieved 16-fold expansion of the training set and improved accuracy by 1.1%,4.9%,and 8.7%in intent recognition,question classification,and sentiment analysis,respectively when compared with GPT-2.Experimental results demonstrate that the proposed method can effectively balance the requirements for semant-ic consistency and diversity,enhancing the training performance of downstream task models.

natural language processingartificial intelligencedata augmentationsentence classificationfew samplessequence to sequencegenerative pre-trained language modelbidirectional encoder representation from Transformers

张小川、陈盼盼、邢欣来、杨昌萌、滕达

展开 >

重庆理工大学 两江人工智能学院, 重庆 401135

自然语言处理 人工智能 数据增强 句子分类 少样本 序列到序列 生成式预训练语言模型 双向编码器表征模型

国家自然科学基金重庆市技术创新与应用发展专项

61702063cstc2021jscxdxwtBX0019

2024

智能系统学报
中国人工智能学会 哈尔滨工程大学

智能系统学报

CSTPCD北大核心
影响因子:0.672
ISSN:1673-4785
年,卷(期):2024.19(1)
  • 31