A Dataset for Automatically Detecting ChatGPT-Generated Open-domain Texts
Large language models like ChatGPT,have exhibited impressive the capabilities of language comprehen-sion,generation,and knowledge reasoning.To improve the capabilities of automatic detection for ChatGPT-gener-ated open-domain texts,high-quality dataset is needed.Currently available ChatGPT detection datasets are rela-tively small in scale and exhibit a limited range of linguistic styles.This paper introduces a multi-style ChatGPT detection dataset.This dataset has the following characteristics:(1)Large-scale,comprising nearly 180,000 hu-man-written texts and equivalent amount of ChatGPT-generated texts;(2)Bilingual,including both English and Chinese Texts;(3)Diverse text styles,encompassing open-domain formal texts to informal texts,i.e.,news texts,social media texts and user comments;(4)Variable text length,including short texts of a few characters and long texts of thousands of characters.Finally,this paper performs linguistic analysis on the proposed dataset and e-valuates the current baseline methods.