"论文工厂"的自动检测特征模型研究

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：[目的]探索"论文工厂"论文自动检测的特征模型,构建从多个维度自动化甄别"论文工厂"论文的工具,为我国科研诚信治理和学术出版质量控制提供重要支持.[方法]从撤稿观察等网站搜集"论文工厂"论文的撤稿记录及关联数据资源,构建用于训练及评价"论文工厂"自动化检测模型的首个公开数据集,构建文本随机游走策略与文本注意力机制的"论文工厂"论文分类模型(RWTA-Model),建模33种"论文工厂"文法特征,并使用SHAP方法自动挖掘显著特征.[结果]基于标题结构特征、基于摘要结构特征、基于正文结构特征F1值分别达到0.766 9、0.842 3、0.848 0.对于三种文章结构数据,所提方法与多种基线方法对比均取得了最好的结果,并挖掘了 12种显著的文法特征.[局限]支撑特征构建的数据集集中于生物医学领域,存在领域偏见的潜在风险.[结论]构建的"论文工厂"标题、摘要和正文结构三个维度的分类模型与33种维度的自动检测特征模型,可以有效甄别出"论文工厂"论文并挖掘多维度特征,支撑"论文工厂"论文的自动化检测.

外文标题：Automatic Detection Model for"Paper Mills"

外文摘要：[Objective]This study explores feature models for the automated detection of articles by"paper mills"across multiple dimensions.It aims to provide critical support for the governance of research integrity and quality control of academic publishing in China.[Methods]We retrieved retraction records and associated data resources of"paper mills"articles from websites like Retraction Watch to construct the first open dataset for training and evaluating the automated detection model for paper mills.We developed a classification model for"paper mill"papers(RWTA-Model)using a text random walk strategy and text attention mechanism.We modeled 33 grammatical features of"paper mills".Finally,we used the SHAP method to identify significant features automatically.[Results]The F1 scores based on title structure features,abstract structure features,and main text structure features reached 0.7669,0.8423,and 0.8480,respectively.For the three types of article structure data,the proposed method achieved the best results when compared to various baseline methods and identified 12 significant grammatical features.[Limitations]The supporting feature construction dataset primarily focuses on the biomedical field,presenting a potential risk of domain bias.[Conclusions]The constructed classification model based on title,abstract,and main text structures,and the 3 3-dimensional automatic detection feature model,can effectively identify"paper mill"papers and uncover multidimensional features,supporting the automated detection of papers from paper mills.

外文关键词：

Paper MillsResearch IntegrityDeep LearningNatural Language Processing

作者：

胡天翼、刘建华、鄂海红、丁峻鹏、乔晓东

展开 >

作者单位：

北京邮电大学计算机学院(国家示范性软件学院) 北京 100876

北京万方数据股份有限公司北京 100080

关键词：

论文工厂科研诚信深度学习自然语言处理

出版年：

2024

DOI：

10.11925/infotech.2096-3467.2023.0937

数据分析与知识发现

中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI

影响因子：1.452

ISSN：2096-3467

年,卷(期)：2024.8(10)

浏览量1