基于深度学习方法的中文书籍专业术语提取方法研究
Professional Terms Extracting Method of Books in Chinese Based on Deep Learning
聂耀鑫 1蒋东来 1程国军1
作者信息
- 1. 太极计算机股份有限公司,北京 100012
- 折叠
摘要
中文缺乏单词边界,从非结构化文本中识别中文专业术语十分具有挑战性,因此专业术语识别技术的应用的情景非常多样化.设计了一种针对任意领域内中文的提取专业术语的新方法.首先获取文本数据的分词结果,然后采用基于BERT改进的词表征方法获得词向量,最后使用基于自动编码器的深度聚类方法完成对中文专业术语的提取.分别在公开数据集和自选取的专业书籍数据上做了对比实验.与其他方法相比,改进后算法在精确率、召回率和F1 值3 个指标上都有了明显的提升.
Abstract
Chinese lacks word boundaries,and identifying Chinese professional terms from unstructured text is very challenging.Therefore,the application scenarios of professional term recognition technology are very diverse.A new method for extracting professional terms from Chinese in any field has been designed.Firstly,we obtain the segmentation results of text data,then use an improved word representation method based on BERT to obtain word vectors,and finally use a deep clustering method based on autoencoder to complete the extraction of Chinese professional terms.Comparative experiment has been conducted on publicly available datasets and data from self-selected professional books.Compared with other methods,the improved algorithm has shown significant improvements in accuracy,recall,and F1 value.
关键词
专业术语/深度学习/深度聚类/实体命名识别/机器学习Key words
professional terms/deep learning/deep clustering/entity naming recognition/machine learning引用本文复制引用
出版年
2024