中国物理B(英文版)2024,Vol.33Issue(5) :117-124.DOI:10.1088/1674-1056/ad3c30

Literature classification and its applications in condensed matter physics and materials science by natural language processing

吴思远 朱天念 涂思佳 肖睿娟 袁洁 吴泉生 李泓 翁红明
中国物理B(英文版)2024,Vol.33Issue(5) :117-124.DOI:10.1088/1674-1056/ad3c30

Literature classification and its applications in condensed matter physics and materials science by natural language processing

吴思远 1朱天念 1涂思佳 2肖睿娟 3袁洁 3吴泉生 3李泓 4翁红明3
扫码查看

作者信息

  • 1. Institute of Physics,Chinese Academy of Sciences,Beijing 100190,China;School of Physical Sciences,University of Chinese Academy of Sciences,Beijing 100190,China;Condensed Matter Physics Data Center of Chinese Academy of Sciences,Beijing 100190,China
  • 2. Institute of Physics,Chinese Academy of Sciences,Beijing 100190,China;College of Materials Science and Optoelectronic Technology,University of Chinese Academy of Sciences,Beijing 100049,China
  • 3. Institute of Physics,Chinese Academy of Sciences,Beijing 100190,China;Condensed Matter Physics Data Center of Chinese Academy of Sciences,Beijing 100190,China
  • 4. Institute of Physics,Chinese Academy of Sciences,Beijing 100190,China
  • 折叠

Abstract

The exponential growth of literature is constraining researchers'access to comprehensive information in related fields.While natural language processing(NLP)may offer an effective solution to literature classification,it remains hindered by the lack of labelled dataset.In this article,we introduce a novel method for generating literature classification models through semi-supervised learning,which can generate labelled dataset iteratively with limited human input.We apply this method to train NLP models for classifying literatures related to several research directions,i.e.,battery,superconductor,topological material,and artificial intelligence(AI)in materials science.The trained NLP'battery'model applied on a larger dataset different from the training and testing dataset can achieve Fl score of 0.738,which indicates the accuracy and reliability of this scheme.Furthermore,our approach demonstrates that even with insufficient data,the not-well-trained model in the first few cycles can identify the relationships among different research fields and facilitate the discovery and understanding of interdisciplinary directions.

Key words

natural language processing/text mining/materials science

引用本文复制引用

基金项目

Informatization Plan of Chinese Academy of Sciences(CAS-WX2021SF-0102)

国家重点研发计划(2022YFA1603903)

国家重点研发计划(2022YFA1403800)

国家重点研发计划(2021YFA0718700)

国家自然科学基金(11925408)

国家自然科学基金(11921004)

国家自然科学基金(12188101)

中国科学院项目(XDB33000000)

出版年

2024
中国物理B(英文版)
中国物理学会和中国科学院物理研究所

中国物理B(英文版)

CSTPCDEI
影响因子:0.995
ISSN:1674-1056
参考文献量36
段落导航相关论文