首页|面向主题场景的科技文献AI数据体系建设:技术框架研究与实践

面向主题场景的科技文献AI数据体系建设:技术框架研究与实践

扫码查看
[目的/意义]人工智能赋能科学研究已成为推动科学发现的重要驱动力。面向主题场景的高质量数据资源是训练高性能AI模型的关键,鉴于科技文献数据的复杂性及其直接用于大模型训练的局限性,亟须构建一套系统化的数据建设技术框架,通过对科技文献资源进行一系列的加工、提炼和整合,最终构建面向AI应用的高质量训练语料。[方法/过程]本研究提出了科技文献AI数据体系建设的"3+5 技术框架",围绕AI数据体系建设全流程,提炼设计了3个层次的数据内容,以及5个阶段的数据治理过程,基于大数据技术、智能挖掘技术作为数据治理的关键要素,详细阐述了数据治理工具链的体系架构与功能。[结果/结论]为验证所提出的技术框架的有效性,本研究将其应用于水稻育种领域的AI数据体系构建实践中。结果表明,该框架能够有效地处理科技文献数据,构建出了高质量的领域数据集,为AI模型在水稻育种研究中的应用提供了数据支撑,验证了该技术框架的有效性和实用性。
Construction of a Scientific Literature AI Data System for the Thematic Scenario:Technical Framework Research and Practice
[Purpose/Significance]Artificial intelligence is empowering scientific research and has become a major driver of scientific discovery.High-quality data resources for thematic scenarios are the key to training high-performance AI models.Given the complexity of scientific and technological(S&T)literature data and the limitations of its direct use for large-scale model training,there is a urgent need to build a systematic data construction technology framework to process,refine and curate S&T literature resources,and ultimately build a high-quality training corpus for AI applications.Some experts have conducted a number of studies,but there is still a lack of research on S&T literature AI data system for thematic scenarios.[Method/Process]This article proposes a"3+5 technical framework"plan for the construction of an AI data system for themed scenarios.Focusing on the whole process of AI data system construction,it refined and designed three levels of data content and five stages of data governance.The three-level data structure inclueds the multi-type basic database,the multi-model deconstruction database and fine-grained semantic mining knowledge base.The five-level construction stages are multi-channel data source scanning,multi-type basic data construction,multi-modal deconstruction data construction,fine-grained semantic mining knowledge construction and multi-scenario data application.Taking big data technology and intelligent mining technology as the key elements of data governance,the system architecture and functions of the data governance tool chain are described in detail.The core components of the tool chain are multi-source data aggregation tool,multi-format data parsing tool,data cleaning tool,associated file identification and acquisition tool,data fusion tool,multi-modal deconstruction and reorganization tool,and fine-grained knowledge identification tool.Working together,these tools ensure the efficiency and integrity of the design process from raw data to the AI data system.[Results/Conclusions]To verify the effectiveness of the proposed technical framework,this study has built a knowledge base in the field of rice breeding.The AI data system for thematic scenario of rice intelligent breeding includes a multi-type basic knowledge layer,a multi-modal deconstruction and recombination knowledge layer and a fine-grained semantic mining knowledge layer.The basic knowledge layer includes general scientific papers and patent data;the multi-modal knowledge layer includes the multi-modal data deconstruction of the paper content;the domain semantic mining knowledge layer focuses on the professional knowledge in rice intelligent breeding,such as rice variety validation data,phenotypic characteristics data,and rice lineage network.The results showed that the framework can effectively process S&T literature data and build a high-quality domain knowledge base,providing data support for the application of AI models in rice breeding research,verifying the effectiveness and practicality of the framework.

AI data systemmulti-modal deconstructionsemantically annotated datadata governance tool chaindata feature quantization

常志军、钱力、吴垚葶、曲云鹏、巩玥、张智雄

展开 >

中国科学院文献情报中心,北京 100190

中国科学院大学 经济与管理学院信息资源管理系,北京 100190

国家新闻出版署 学术期刊新型出版与知识服务重点实验室,北京 100190

AI数据体系 多模态解构 语义标注数据 数据治理工具链 数据特征向量化

2024

农业图书情报学报
中国农业科学院农业信息研究所

农业图书情报学报

影响因子:0.48
ISSN:1002-1248
年,卷(期):2024.36(9)