ChatGPT生成开放领域文本自动检测数据集构建

A Dataset for Automatically Detecting ChatGPT-Generated Open-domain Texts

徐康 ¹惠志磊 ¹董振江 ¹蔡霈涵 ¹陆立群¹

扫码查看

作者信息

1. 南京邮电大学计算机学院软件学院网络空间安全学院,江苏南京 210042
折叠

摘要

近年来,ChatGPT等大模型展现出卓越的语言理解、生成和知识推理能力,但是这些大模型也存在幻觉生成和内容抄袭等问题.为了自动检测ChatGPT生成的开放领域文本,需要高质量的数据集支撑.目前现有的ChatGPT生成开放领域文本检测数据集规模较小,语料风格单一.该文构建了一个多样化的ChatGPT检测数据集,具有以下特点:①规模大,主要包括近180 000条人类文本和相同数量的ChatGPT生成文本;②双语数据,包括英文和中文文本;③风格多样化,开放领域文本涵盖正式风格的文本和口语化风格的文本,包括新闻、社交媒体文本和用户评论;④文本长度多样化,包括数个字符的超短文本和上千字符的长文本.最后,该文对提出的数据集进行语言学分析,并评估了当前的主流基准方法.

Abstract

Large language models like ChatGPT,have exhibited impressive the capabilities of language comprehen-sion,generation,and knowledge reasoning.To improve the capabilities of automatic detection for ChatGPT-gener-ated open-domain texts,high-quality dataset is needed.Currently available ChatGPT detection datasets are rela-tively small in scale and exhibit a limited range of linguistic styles.This paper introduces a multi-style ChatGPT detection dataset.This dataset has the following characteristics:(1)Large-scale,comprising nearly 180,000 hu-man-written texts and equivalent amount of ChatGPT-generated texts;(2)Bilingual,including both English and Chinese Texts;(3)Diverse text styles,encompassing open-domain formal texts to informal texts,i.e.,news texts,social media texts and user comments;(4)Variable text length,including short texts of a few characters and long texts of thousands of characters.Finally,this paper performs linguistic analysis on the proposed dataset and e-valuates the current baseline methods.

关键词

ChatGPT/文本生成/文本分类/数据集/开放领域

Key words

ChatGPT/text generation/text classification/dataset/open-domain

引用本文复制引用

出版年

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCSCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

段落导航