Construction of a Scientific Literature AI Data System for the Thematic Scenario:Technical Framework Research and Practice
[Purpose/Significance]Artificial intelligence is empowering scientific research and has become a major driver of scientific discovery.High-quality data resources for thematic scenarios are the key to training high-performance AI models.Given the complexity of scientific and technological(S&T)literature data and the limitations of its direct use for large-scale model training,there is a urgent need to build a systematic data construction technology framework to process,refine and curate S&T literature resources,and ultimately build a high-quality training corpus for AI applications.Some experts have conducted a number of studies,but there is still a lack of research on S&T literature AI data system for thematic scenarios.[Method/Process]This article proposes a"3+5 technical framework"plan for the construction of an AI data system for themed scenarios.Focusing on the whole process of AI data system construction,it refined and designed three levels of data content and five stages of data governance.The three-level data structure inclueds the multi-type basic database,the multi-model deconstruction database and fine-grained semantic mining knowledge base.The five-level construction stages are multi-channel data source scanning,multi-type basic data construction,multi-modal deconstruction data construction,fine-grained semantic mining knowledge construction and multi-scenario data application.Taking big data technology and intelligent mining technology as the key elements of data governance,the system architecture and functions of the data governance tool chain are described in detail.The core components of the tool chain are multi-source data aggregation tool,multi-format data parsing tool,data cleaning tool,associated file identification and acquisition tool,data fusion tool,multi-modal deconstruction and reorganization tool,and fine-grained knowledge identification tool.Working together,these tools ensure the efficiency and integrity of the design process from raw data to the AI data system.[Results/Conclusions]To verify the effectiveness of the proposed technical framework,this study has built a knowledge base in the field of rice breeding.The AI data system for thematic scenario of rice intelligent breeding includes a multi-type basic knowledge layer,a multi-modal deconstruction and recombination knowledge layer and a fine-grained semantic mining knowledge layer.The basic knowledge layer includes general scientific papers and patent data;the multi-modal knowledge layer includes the multi-modal data deconstruction of the paper content;the domain semantic mining knowledge layer focuses on the professional knowledge in rice intelligent breeding,such as rice variety validation data,phenotypic characteristics data,and rice lineage network.The results showed that the framework can effectively process S&T literature data and build a high-quality domain knowledge base,providing data support for the application of AI models in rice breeding research,verifying the effectiveness and practicality of the framework.
AI data systemmulti-modal deconstructionsemantically annotated datadata governance tool chaindata feature quantization