一种建立在GPT-2模型上的数据增强方法

A data augmentation method built on GPT-2 model

张小川 ¹陈盼盼 ¹邢欣来 ¹杨昌萌 ¹滕达¹

扫码查看

作者信息

1. 重庆理工大学两江人工智能学院, 重庆 401135
折叠

摘要

针对句子分类任务常面临着训练数据不足,而且文本语言具有离散性,在语义保留的条件下进行数据增强具有一定困难,语义一致性和多样性难以平衡的问题,本文提出一种惩罚生成式预训练语言模型的数据增强方法(punishing generative pre-trained transformer for data augmentation,PunishGPT-DA).设计了惩罚项和超参数α,与负对数似然损失函数共同作用微调GPT-2(generative pre-training 2.0),鼓励模型关注那些预测概率较小但仍然合理的输出;使用基于双向编码器表征模型(bidirectional encoder representation from transformers,BERT)的过滤器过滤语义偏差较大的生成样本.本文方法实现了对训练集 16 倍扩充,与GPT-2 相比,在意图识别、问题分类以及情感分析 3 个任务上的准确率分别提升了 1.1%、4.9%和 8.7%.实验结果表明,本文提出的方法能够同时有效地控制一致性和多样性需求,提升下游任务模型的训练性能.

Abstract

The sentence classification task often faces the problem of insufficient training data.Moreover,text language is discrete,and it is difficult to perform data augmentation under the condition of semantic preservation.Balancing se-mantic consistency and diversity is also challenging.To address these issues,this paper proposes a punishing generative pre-trained transformer for data augmentation,PunishGPT-DA for short.A penalty term and hyperparameter α are de-signed.They work together with the negative log-likelihood loss function to fine tune GPT-2(generative pre-training 2.0)and encourage the model to focus on the outputs with small predicted probabilities but still reasonable.A filter based on BERT(bidirectional encoder representation from transformers)is used to remove generated samples with sig-nificant semantic bias.The method has achieved 16-fold expansion of the training set and improved accuracy by 1.1%,4.9%,and 8.7%in intent recognition,question classification,and sentiment analysis,respectively when compared with GPT-2.Experimental results demonstrate that the proposed method can effectively balance the requirements for semant-ic consistency and diversity,enhancing the training performance of downstream task models.

关键词

自然语言处理/人工智能/数据增强/句子分类/少样本/序列到序列/生成式预训练语言模型/双向编码器表征模型

Key words

natural language processing/artificial intelligence/data augmentation/sentence classification/few samples/sequence to sequence/generative pre-trained language model/bidirectional encoder representation from Transformers

引用本文复制引用

基金项目

国家自然科学基金(61702063)

重庆市技术创新与应用发展专项(cstc2021jscxdxwtBX0019)

出版年

2024

智能系统学报

中国人工智能学会　哈尔滨工程大学

智能系统学报

CSTPCD北大核心

影响因子：0.672

ISSN：1673-4785

参考文献量31

段落导航