生成式情报学术语自动抽取与多维关联知识挖掘研究

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：情报学术语承载了情报学科基础知识与核心概念.从概念维度梳理与分析情报学术语对推动学科发展、助力下游知识挖掘任务具有重要意义.面对数量快速增长的科技文献,自动术语抽取替代了人工筛选,但现有方法严重依赖大规模标注数据集,难以迁移至低资源场景.本文设计了一种生成式情报学术语抽取方法(generative term ex-traction for information science,GTX-IS),将传统基于序列标注的抽取式任务转化为序列到序列的生成式任务.结合小样本学习策略与有监督微调,提升面向特定任务的文本生成能力,能够在低资源有标签数据集场景下较为精准地抽取情报学术语.对于抽取结果,本文进一步开展了情报学领域术语发现及多维知识挖掘.综合运用全文科学计量与信息计量方法,从术语自身、术语间关联、时间信息等维度,对术语的出现频次、生命周期、共现信息等进行统计分析与知识挖掘.采用社会网络分析方法,结合时间维度特征,从术语角度出发,完善期刊的动态简介,探究情报学研究热点、演变历程和未来发展趋势.本文方法在术语抽取实验中的表现超越了全部13种主流生成式和抽取式模型,展现出较强的小样本学习能力,为领域信息抽取提供了新的思路.

外文标题：Automatic Generative Information Science Term Extraction and Multidimensional Linked Knowledge Mining

外文摘要：Information science terminology conveys the basic knowledge and core concepts of information science disci-pline.It is thus of great significance to sort out and analyze information science terms from the basic concepts to promote the development of the discipline and assist downstream knowledge mining tasks.With the rapidly growing amount of sci-entific and technological literature,automatic term extraction has replaced manual screening,but existing methods rely heavily on large-scale labeled datasets,making it difficult to migrate to low-resource scenarios.This study designs a Gener-ative Term eXtraction for Information Science(GTX-IS)method,which transforms the traditional extraction task based on sequence labeling into a sequence-to-sequence generative task.Combined with few-shot learning strategies and supervised fine-tuning,it improves the ability to generate text for specific tasks and can more accurately extract information science terms in low-resource scenarios.For the extraction results,this study further develops term discovery and multi-dimension-al knowledge mining in the field of information science,and comprehensively uses full-text informetric and scientometric methods to conduct statistical analysis and knowledge mining on the frequency of occurrence,life cycle,and co-occur-rence information of terms from the dimensions of the term itself,the relationship between terms,and time information.Using the social network analysis method,combined with the characteristics of the time dimension,this study improves the dynamic profile of journals,facilitating the exploration of the research hotspots,evolution process,and future development trends of information science.The proposed method surpasses all 13 baseline generative and extractive models,showing a strong few-shot learning ability,and provides a new idea for domain information extraction.

外文关键词：

information science termautomatic term extractiontext generationscientometricshotspot analysis

作者：

胡昊天、邓三鸿、孔玲、闫晓慧、杨文霞、王东波、沈思

展开 >

作者单位：

江苏省农业科学院,南京 210014

南京大学信息管理学院,南京 210023

数据工程与知识服务省高校重点实验室(南京大学),南京 210023

山东理工大学信息管理学院,淄博 255049

南京农业大学信息管理学院,南京 210095

南京理工大学经济管理学院,南京 210094

展开 >

关键词：

情报学术语术语自动抽取文本生成科学计量热点分析

基金：

国家社会科学基金重大项目国家自然科学基金面上项目中央高校基本科研业务费专项南京大学项目

项目编号：

20&ZD332719740940108-14370317

出版年：

2024

DOI：

10.3772/j.issn.1000-0135.2024.05.008

情报学报

中国科学技术情报学会　中国科学技术信息研究所

情报学报

CSTPCDCSSCICHSSCD北大核心

影响因子：1.296

ISSN：1000-0135

年,卷(期)：2024.43(5)

参考文献量34