基于深度文本聚类的论文与专利数据融合方法研究

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：[目的]克服论文与专利之间语言特征差异的障碍,将论文和专利数据按照研究主题集成融合.[方法]以维基百科为基本分类体系,通过半自动方式构建少量标注集,设计半监督深度文本聚类模型,将相似主题的论文与专利聚类融合,设计指标评估数据融合结果的质量.[结果]所提模型在两个数据集上的聚类准确率比其他基线模型提升了 2.4～11.9个百分点,数据融合结果的质量评估得分超过0.9,优于基线模型,可以在已知主题的基础上补充研究主题.[局限]未利用融合数据开展实证分析,聚类数目需要人工确定.[结论]所提模型可以从论文和专利差异化的文本中提取与主题相关的特征,有效地实现数据融合.

外文标题：Paper and Patent Data Fusion Based on Deep Text Clustering

外文摘要：[Objective]This study integrates papers and patents based on research topics to bridge their language gaps.[Method]Using Wikipedia as the primary classification system,we constructed a small number of annotation sets semi-automatically.Then,we designed a semi-supervised deep text clustering model to fuse papers and patents with similar topics.Finally,we created indicators to evaluate the data fusion quality.[Results]Our model's clustering accuracy was 2.4～11.9％higher than that of other baseline models.Its quality evaluation score of data fusion reached 0.9,which can supplement research topics based on the known topics.[Limitations]We did not conduct empirical analysis using the fused data and need to determine the cluster numbers manually.[Conclusion]The proposed model can extract topic-related features from differentiated texts of papers and patents to effectively realize data fusion.

外文关键词：

Deep Text ClusteringData FusionPapersPatentsResearch Topic Identification

作者：

谢士尧、王小梅

展开 >

作者单位：

中国科学院科技战略咨询研究院北京 100190

中国科学院大学公共政策与管理学院北京 100049

关键词：

深度文本聚类数据融合论文专利研究主题识别

基金：

中国科学院战略研究专项

项目编号：

GHJ-ZLZX-2022-09

出版年：

2024

DOI：

10.11925/infotech.2096-3467.2023.0232

数据分析与知识发现

中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI

影响因子：1.452

ISSN：2096-3467

年,卷(期)：2024.8(4)

参考文献量51