情报学术语承载了情报学科基础知识与核心概念.从概念维度梳理与分析情报学术语对推动学科发展、助力下游知识挖掘任务具有重要意义.面对数量快速增长的科技文献,自动术语抽取替代了人工筛选,但现有方法严重依赖大规模标注数据集,难以迁移至低资源场景.本文设计了一种生成式情报学术语抽取方法(generative term ex-traction for information science,GTX-IS),将传统基于序列标注的抽取式任务转化为序列到序列的生成式任务.结合小样本学习策略与有监督微调,提升面向特定任务的文本生成能力,能够在低资源有标签数据集场景下较为精准地抽取情报学术语.对于抽取结果,本文进一步开展了情报学领域术语发现及多维知识挖掘.综合运用全文科学计量与信息计量方法,从术语自身、术语间关联、时间信息等维度,对术语的出现频次、生命周期、共现信息等进行统计分析与知识挖掘.采用社会网络分析方法,结合时间维度特征,从术语角度出发,完善期刊的动态简介,探究情报学研究热点、演变历程和未来发展趋势.本文方法在术语抽取实验中的表现超越了全部13种主流生成式和抽取式模型,展现出较强的小样本学习能力,为领域信息抽取提供了新的思路.
Automatic Generative Information Science Term Extraction and Multidimensional Linked Knowledge Mining
Information science terminology conveys the basic knowledge and core concepts of information science disci-pline.It is thus of great significance to sort out and analyze information science terms from the basic concepts to promote the development of the discipline and assist downstream knowledge mining tasks.With the rapidly growing amount of sci-entific and technological literature,automatic term extraction has replaced manual screening,but existing methods rely heavily on large-scale labeled datasets,making it difficult to migrate to low-resource scenarios.This study designs a Gener-ative Term eXtraction for Information Science(GTX-IS)method,which transforms the traditional extraction task based on sequence labeling into a sequence-to-sequence generative task.Combined with few-shot learning strategies and supervised fine-tuning,it improves the ability to generate text for specific tasks and can more accurately extract information science terms in low-resource scenarios.For the extraction results,this study further develops term discovery and multi-dimension-al knowledge mining in the field of information science,and comprehensively uses full-text informetric and scientometric methods to conduct statistical analysis and knowledge mining on the frequency of occurrence,life cycle,and co-occur-rence information of terms from the dimensions of the term itself,the relationship between terms,and time information.Using the social network analysis method,combined with the characteristics of the time dimension,this study improves the dynamic profile of journals,facilitating the exploration of the research hotspots,evolution process,and future development trends of information science.The proposed method surpasses all 13 baseline generative and extractive models,showing a strong few-shot learning ability,and provides a new idea for domain information extraction.
information science termautomatic term extractiontext generationscientometricshotspot analysis