基于生成式预训练语言模型的学者画像构建研究

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：大数据时代,互联网中以多源异构、非结构化形式存在的学者信息在实体抽取时伴有属性混淆、长实体等问题,严重影响学者画像构建的精准度.与此同时,学者属性实体抽取模型作为学者画像构建过程中的关键模型,在实际应用方面还存在较高的技术门槛,这对学者画像的应用推广造成一定阻碍.为此,在开放资源的基础上,通过引导句建模、自回归生成方式、训练语料微调等构建一种基于生成式预训练语言模型的属性实体抽取框架,并从模型整体效果、实体类别抽取效果、主要影响因素实例分析、样例微调影响分析4个方面对该方法进行验证分析.与对比模型相比,所提出的方法在12类学者属性实体上均达到最优效果,其综合F1值为99.34%,不仅能够较好地识别区分相互混淆的属性实体,对"研究方向"这一典型长属性实体的抽取准确率还提升了6.11%,为学者画像的工程化应用提供了更快捷、有效的方法支撑.

外文标题：Construction of Scholar Profile Based on Generative Pre-Trained Language Model

外文摘要：In the era of big data,the information of scholars in the Internet that exists in a multi-source heterogeneous and unstructured form is accompanied by problems such as attribute confusion and long entities during entity extraction,which seriously affect the accuracy of the construction of scholar profiles.Meanwhile,the scholar attribute entity extraction model,as a key model in the construction of scholar profiles,still presents significant technical barriers in practical applications,which pose certain obstacles to the widespread application of scholar profiles.Therefore,based on open resources,we construct an attribute entity extraction method based on generative pre-trained language models through guided sentence modelling,autoregressive generation approach,and training corpus fine-tuning,and validate the method from four aspects:overall model effect,entity category extraction effect,instance analysis of the main influencing factors,and analysis of sample fine-tuning impact.Compared with the contrastive models,the method proposed in this paper achieves optimal performance across 12 categories of scholar attribute entities,with a comprehensive F1 score of 99.34%.It not only effectively identifies and differentiates mutually confusing attribute entities,but also enhances the extraction precision of typical long attribute entities such as"research interests"by 6.11%.This method provides more expedient and effective methodological support for the engineering application of scholar profiles.

外文关键词：

Generative Pre-Trained Language ModelSample Fine-TuningScholar ProfileGPT-3

作者：

柳涛、丁陈君、姜恩波、许睿、陈方

展开 >

作者单位：

中国科学院成都文献情报中心,成都 610299

中国科学院大学信息资源管理系,北京 100190

关键词：

生成式预训练语言模型样例微调学者画像 GPT-3

基金：

"西部之光"人才培养计划中国科学院成都文献情报中心创新基金

项目编号：

E1C0000401E1Z0000101

出版年：

2024

DOI：

10.3772/j.issn.1673-2286.2024.03.001

数字图书馆论坛

中国科学技术信息研究所（ISTIC）北京万方数据股份有限公司

数字图书馆论坛

CSTPCD

影响因子：0.337

ISSN：1673-2286

年,卷(期)：2024.20(3)

参考文献量21