基于SimBERT+CNN的专利智能分类技术研究

Intelligent Patent Classification Technology Research Based on SimBERT+CNN Model

洪群业 ¹刘琦 ²刘春燕 ²郑路 ¹李烨辉 ³杨申学²

扫码查看

作者信息

1. 中国烟草总公司郑州烟草研究院,郑州 450001
2. 知识产权出版社有限责任公司,北京 100081
3. 南京理工大学知识产权学院,南京 210094
折叠

摘要

本文基于SimBERT+CNN深度学习模型,以烟草产业相关专利为例,研究了基于烟草相关技术专利文献的智能分类技术,用于专利数据的自动技术分类或者人工辅助分类.主要研究方法:利用人工对烟草相关专利文献进行二级技术分类标注,将包括烟草技术类和非烟草技术类专利作为深度学习的样本数据,然后抽取相关专利中有X类引证的专利文献中的权利要求项和被引专利的对应文本段落作为句对,用于优化基于SimBERT构建的语义模型训练,使用训练优化后的SimBERT模型,对烟草行业的专利分类样本数据进行文字型特征向量和IPC分类号特征向量特征拼接并输入CNN模型.通过对 15万余件烟草技术专利和 2万余件非烟草技术专利样本的实证训练和测试,发现基于采用上述优化方法的SimBERT+CNN模型对烟草技术专利的技术分类测试准确率在一级烟草技术分类和二级技术分类方面均优于使用BERT+CNN的分类效果.

Abstract

This paper presents a SimBERT+CNN deep learning model for intelligent patent classification in the tobacco industry,using tobacco-related technology patents as examples.The main research method is as follows:Tobacco-related patents are manually annotated with two-level technology classifications,including tobacco technology class and non-tobacco technology class patents,to serve as sample data for deep learning.For patents with X-type citations,claim items and the corresponding text paragraphs of the cited patents are extracted as sentence pairs to optimize the semantic model training based on SimBERT.The optimized SimBERT model is used to generate textual feature vectors and IPC classification number feature vectors for the patent classification samples in the tobacco industry.These features are concatenated and fed into a CNN model.Through empirical training and testing on over 150,000 tobacco technology patents and 20,000 non-tobacco technology patents,it is found that the SimBERT+CNN model optimized by the above methods achieves higher accuracy in both first-level tobacco technology classification and second-level technology classification compared to using BERT+CNN.

关键词

专利智能分类/烟草行业/SimBERT/CNN/深度学习

Key words

intelligent patent classification/tobacco industry/SimBERT/CNN/deep learning

引用本文复制引用

基金项目

中国烟草总公司重大科技项目(110202101082)

出版年

2024

中国发明与专利

知识产权出版社,中国发明协会

中国发明与专利

CSTPCD

影响因子：0.15

ISSN：1672-6081

参考文献量11

段落导航