首页|中文专利关键信息语料库的构建研究

中文专利关键信息语料库的构建研究

扫码查看
专利文献是一种重要的技术文献,是知识产权强国的重要工作内容。目前专利语料库 多集中于信息检索、机器翻译以及文本文分类等领域,尚缺乏更细粒度的标注,不足 以支持问答、阅读理解等新形态的人工智能技术研发。本文面向专利智能分析的需 要,提出了从解决问题、技术手段、效果三个角度对发明专利进行专利标注,并最终 构建了包含313篇的中文专利关键信息语料库。利用命名实体识别技术对语料库关键信 息进行识别和验证,表明专利关键信息的识别是不同于领域命名实体识别的更大粒度 的信息抽取难题。
中文专利关键信息语料库的构建研究
As a kind of imp oft ant technology document, the patent is of substantial significance to the national in%llectualwroperty strategy in China. Existing patent corpus are mostly for the purpose. of information retrieval and machine translation task, leaving the fine-grained annotated patent less touched. To facilitate the forth-coming intelligent patent technology development, this paper constructs a Patent Key Information Corpus, consisting of 313 patents annotated with the issues, methods and effects in the texts. Then the SOTA named entity recognition models are applied to the corpus, and the sharping decrease in the performance indicate the automatic identification of the key information in a patent is a challenging IE task.

专利;语料库;关键信息

张文婷、赵美含、马翊轩、王文瑞、刘宇哲、杨沐昀

展开 >

哈尔滨工业大学计算学部,黑龙江哈尔滨150001

哈尔滨市阳光惠远知识产权代理有限公司,黑龙江哈尔滨150000

专利;语料库;关键信息

Chinese national conference on computational linguistic

Nanchang(CN)

The 21st Chinese national conference on computational linguistic

455-463

2022