科技文献大模型:方法、框架与应用

The Large Language Model for Scientific Literature:Method,Framework,and Application

钱力 ¹张智雄 ²伍大勇 ³常志军 ⁴于倩倩 ⁵胡懋地 ⁶刘熠⁶

扫码查看

作者信息

1. 中国科学院文献情报中心数据资源部;中国科学院大学信息资源管理系;国家新闻出版署学术期刊新型出版与知识服务重点实验室,北京 100190
2. 中国科学院文献情报中心;中国科学院大学信息资源管理系,北京 100190
3. 科大讯飞AI研究院;讯飞北京研究院,北京102629
4. 中国科学院文献情报中心数据资源部;中国科学院大学信息资源管理系,北京100190
5. 中国科学院文献情报中心,北京100190
6. 中国科学院文献情报中心智能情报支持战略决策重点实验室,北京100190
折叠

摘要

大语言模型的出现深刻改变了知识生产方式和用户获取知识及情报的方式,对科技文献的分析和服务工作也产生重要影响.本文在系统梳理专业领域大模型研究进展的基础上,总结专业领域大模型的技术路径和应用场景,分析科技文献大模型的现实需求和应用价值,研究设计科技文献大模型的技术体系框架,解决科技文献语料库规范化建设与多轮增量微调训练两大关键问题,预训练科技文献大模型,并基于科技文献大模型研发"星火科研助手"智能知识服务平台.本研究探索科技文献语料库的构建方法,即利用大规模科技文献原始数据资源,从全文段落文本、语步句子、阅读理解问答对等层次构建科技文献内容研读的预训练语料及微调指令数据集,实现科技文献大模型的预训练与微调;基于科技文献大模型研发"星火科研助手"智能知识服务平台,验证了科技文献大模型在文献综述、文献知识提取、文献阅读理解、学术写作润色、多语种翻译、论文校对及全文智能预审等多种典型科研场景中的有效性,展示其跨领域的知识理解能力,为构建智慧科研环境体系提供技术与场景参考.图6.表1.参考文献33.

Abstract

The emergence of the Large Language Model (LLM) has profoundly transformed knowledge production,the methods by which users acquire knowledge and intelligence,and the analysis and service of scientific literature.This paper systematically reviews the progress of LLM based in domain-specific applications,summarizes their technical approaches and application scenarios,and analyzes the practical needs and value of LLM for scientific literature.The paper proposes and designs a technological framework for constructing LLM for scientific literature,addressing two key issues:the standardized construction of a scientific literature corpus and multi-round incremental fine-tuning training.Furthermore,it pre-trains a LLM for scientific literature and develops an intelligent knowledge service platform called Spark Science Research Assistant based on this LLM.The paper has the following innovative achievements:first,it explored the construction method of scientific literature corpus by utilizing the raw data resources of large-scale scientific literature.This involves building a pre-training corpus and fine-tuning instruction sets from the levels of full-text paragraph text,language steps,and reading comprehension Q&A,thereby achieving pre-training and fine-tuning of the LLM for scientific literature;second,based on the pre-trained LIM,the Spark Science Research Assistant platform was developed.This platform demonstrates effectiveness in multiple typical research scenarios,including literature review,literature knowledge extraction,literature comparative analysis,academic writing and polishing,multilingual translation,proofreading of papers,intelligent pre-review of the full paper.Moreover,the model showcases strong interdisciplinary capabilities,facilitating cross-domain knowledge transfer and integration.The paper provides technical and scenario-based references for building a smart scientific research environment system.Future directions include improving the model's performance,expanding its knowledge base,and integrating it with other AI technologies,paving the way for more advanced and comprehensive AI-assisted research tools.6 figs.1 tab.33 refs.

关键词

科技文献大模型/科技文献语料/人工智能/知识服务/星火科研助手

Key words

Large Language Model for scientific literature/Scientific literature corpus/AI/Knowledge service/Spark Science Research Assistant

引用本文复制引用

出版年

2024

中国图书馆学报

国家图书馆,中国图书馆学会

中国图书馆学报

CSTPCDCSSCICHSSCD北大核心

影响因子：4.605

ISSN：1001-8867

段落导航