基于Sentence-BERT的专利技术主题聚类研究—

基于Sentence-BERT的专利技术主题聚类研究——以人工智能领域为例

扫码查看

原文链接

万方数据

中文摘要：[研究目的]将Sentence-BERT模型应用于专利技术主题聚类,解决专利文献为突出新颖性,常使用独特技术术语造成词汇向量语义特征稀疏的问题.[研究方法]以人工智能领域 2015 年-2019 年的 22370 篇专利为实验数据.首先,采用Sentence-BERT算法对专利文献摘要文本进行向量化表示;其次,对向量化矩阵进行数据降维,利用HDBSCAN方式寻找原始数据中的高密度簇;最后,识别类簇文本集合中的主题特征,并完成主题呈现.[研究结论]对比LDA主题模型、K-means、doc2vec等方法,本文的实验结果提高了主题划分的细粒度和精确度,获得了较好的主题一致性.如何采用fine-tune策略进一步提升模型的效果,是未来该方法进一步深入探索的方向.

外文标题：Research onPatent Technology Subject Clustering Based on Sentence-BERT:Taking the Field of Artificial Intelligence as an Example

外文摘要：[Research purpose]The Sentence-Bert model is applied to patent technology topic clustering to solve the problem of sparse se-mantic features of lexical vectors caused by the frequent use of unique technical terms in patent documents in order to highlight novelty.[Research method]The study takes 22370 patents in the field of artificial intelligence from 2015 to 2019 as experimental data.Firstly,the Sentence-Bert algorithm is used to vectorize the patent document abstract text;Secondly,the data dimension of the vectorization ma-trix is reduced,and the HDBSCAN method is used to find the high-density clusters in the original data;Finally,the topic features in the class cluster text collection are identified and the topic presentation was completed.[Research conclusion]Compared with LDA topic model,K-means,doc2vec and other methods,the experimental results of this study improves the granularity and accuracy of topic divi-sion,and obtains better topic consistency.How to use the fine tune strategy to further improve the effect of the model is the direction of further exploration of this method in the future.

外文关键词：

Sentence-BERTpatent textsubject identificationtext clustering

作者：

阮光册、周萌葳

展开 >

作者单位：

华东师范大学经济与管理学部信息管理系上海 200241

关键词：

Sentence-BERT 专利文本主题识别文本聚类

出版年：

2024

DOI：

10.3969/j.issn.1002-1965.2024.02.016

情报杂志

陕西省科学技术信息研究所

情报杂志

CSTPCDCSSCICHSSCD北大核心

影响因子：1.502

ISSN：1002-1965

年,卷(期)：2024.43(2)

参考文献量36