学者撰写与AI生成内容的差异性与识别研究——以图书馆健康服务研究领域为例
Research on the Differences and Recognition between Scholar Writing and AI Generated Content:Taking the Research Field of Library Health Services as an Example
潘雪峰 1王超1
作者信息
- 1. 辽宁工业大学图书馆,辽宁锦州,121000
- 折叠
摘要
为了从实证角度分析图书馆健康服务研究领域中学者撰写摘要与GPT-4生成摘要的特征与差异,选取185篇公开发表的图书馆健康服务相关学术论文作为研究对象,基于获取的论文题目采用Prompt方式并应用GPT-4生成对应的摘要文本并构建数据集,应用HanLP 2.1对论文摘要进行分词,并采用TF-IDF进行向量化处理;通过6种特征筛选和6种数据降维对数据进行清洗;遍历13种机器学习方法并对结果进行分析,并从文本内容层面加以分析.研究发现:LightGBM分类法在数据降维的前提下可以完全区分论文摘要是由学者撰写还是由GPT-4生成;在文本的字数、词数和句数方面,学者撰写与GPT-4生成基本一致;在主题模型分析方面,二者相似度达到50%,学者撰写与GPT-4生成具有一定的相似性.机器学习算法在区分AI生成内容和学者撰写内容方面具有应用潜力,但二者存在明显的"形似"而非"神似"的现象.学者应关注AI生成内容的准确性、真实性以及语言逻辑的合理性,谨慎使用AI工具.
Abstract
In order to empirically analyze the characteristics and differences between scholars writing abstracts and GPT-4 generated abstracts in the field of library health services research,this paper selects 185 publicly published aca-demic papers related to library health services as the research subjects,and based on the obtained paper title,adopts the Prompt method and applies GPT-4 to generate the corresponding abstract text and construct a dataset,applies HanLP 2.1 to segment the abstract of the paper and uses TF-IDF for vectorization processing;cleans the data through 6 feature screenings and 6 data dimensionality reductions;Traverse 13 machine learning methods and analyze the results from the perspective of text content;traverses 13 machine learning methods and analyzes the results from the perspective of text content.Research has found that the LightGBM classification method can completely distinguish whether the abstract of a paper is written by a scholar or generated by GPT-4 under the premise of data dimensionality reduction;from the perspec-tive of word count,word count,and sentence count in the text,the scholar's writing and GPT-4 generation are basically consistent;from the analysis of the topic model,the similarity between the two reaches 50%,indicating a certain degree of similarity between scholar writing and GPT-4 generation.Machine learning algorithms have potential applications in distinguishing between AI generated content and scholar written content,but there is a clear phenomenon of"resem-blance"rather than"resemblance"between the two.Scholars should pay attention to the accuracy,authenticity,and logi-cal reasoning of AI generated content,and use AI tools with caution.
关键词
GPT-4/论文摘要/文本分类/文本特征/图书馆Key words
GPT-4/abstract/text classification/text features/library引用本文复制引用
基金项目
辽宁省图书馆学会青年课题(2023tsgxhqnkt-003)
出版年
2024