首页|基于Sentence-BERT的专利技术主题聚类研究——以人工智能领域为例

基于Sentence-BERT的专利技术主题聚类研究——以人工智能领域为例

扫码查看
[研究目的]将Sentence-BERT模型应用于专利技术主题聚类,解决专利文献为突出新颖性,常使用独特技术术语造成词汇向量语义特征稀疏的问题.[研究方法]以人工智能领域 2015 年-2019 年的 22370 篇专利为实验数据.首先,采用Sentence-BERT算法对专利文献摘要文本进行向量化表示;其次,对向量化矩阵进行数据降维,利用HDBSCAN方式寻找原始数据中的高密度簇;最后,识别类簇文本集合中的主题特征,并完成主题呈现.[研究结论]对比LDA主题模型、K-means、doc2vec等方法,本文的实验结果提高了主题划分的细粒度和精确度,获得了较好的主题一致性.如何采用fine-tune策略进一步提升模型的效果,是未来该方法进一步深入探索的方向.
Research onPatent Technology Subject Clustering Based on Sentence-BERT:Taking the Field of Artificial Intelligence as an Example
[Research purpose]The Sentence-Bert model is applied to patent technology topic clustering to solve the problem of sparse se-mantic features of lexical vectors caused by the frequent use of unique technical terms in patent documents in order to highlight novelty.[Research method]The study takes 22370 patents in the field of artificial intelligence from 2015 to 2019 as experimental data.Firstly,the Sentence-Bert algorithm is used to vectorize the patent document abstract text;Secondly,the data dimension of the vectorization ma-trix is reduced,and the HDBSCAN method is used to find the high-density clusters in the original data;Finally,the topic features in the class cluster text collection are identified and the topic presentation was completed.[Research conclusion]Compared with LDA topic model,K-means,doc2vec and other methods,the experimental results of this study improves the granularity and accuracy of topic divi-sion,and obtains better topic consistency.How to use the fine tune strategy to further improve the effect of the model is the direction of further exploration of this method in the future.

Sentence-BERTpatent textsubject identificationtext clustering

阮光册、周萌葳

展开 >

华东师范大学经济与管理学部信息管理系 上海 200241

Sentence-BERT 专利文本 主题识别 文本聚类

2024

情报杂志
陕西省科学技术信息研究所

情报杂志

CSTPCDCSSCICHSSCD北大核心
影响因子:1.502
ISSN:1002-1965
年,卷(期):2024.43(2)
  • 36