[Objective]This study explores feature models for the automated detection of articles by"paper mills"across multiple dimensions.It aims to provide critical support for the governance of research integrity and quality control of academic publishing in China.[Methods]We retrieved retraction records and associated data resources of"paper mills"articles from websites like Retraction Watch to construct the first open dataset for training and evaluating the automated detection model for paper mills.We developed a classification model for"paper mill"papers(RWTA-Model)using a text random walk strategy and text attention mechanism.We modeled 33 grammatical features of"paper mills".Finally,we used the SHAP method to identify significant features automatically.[Results]The F1 scores based on title structure features,abstract structure features,and main text structure features reached 0.7669,0.8423,and 0.8480,respectively.For the three types of article structure data,the proposed method achieved the best results when compared to various baseline methods and identified 12 significant grammatical features.[Limitations]The supporting feature construction dataset primarily focuses on the biomedical field,presenting a potential risk of domain bias.[Conclusions]The constructed classification model based on title,abstract,and main text structures,and the 3 3-dimensional automatic detection feature model,can effectively identify"paper mill"papers and uncover multidimensional features,supporting the automated detection of papers from paper mills.
Paper MillsResearch IntegrityDeep LearningNatural Language Processing