首页|ChpoBERT:面向中文政策文本的预训练模型

ChpoBERT:面向中文政策文本的预训练模型

扫码查看
随着深度学习的迅速发展和领域数据的快速积累,领域化的预训练模型在知识组织和挖掘中发挥了越来越重要的支撑作用.面向海量的中文政策文本,结合相应的预训练策略构建中文政策文本预训练模型,不仅有助于提升中文政策文本智能化处理的水平,而且为政策文本数据驱动下的精细化和多维度分析与探究奠定了坚实的基础.面向国家级、省级和市级平台上的政策文本,通过自动抓取和人工辅助相结合的方式,在去除非政策文本的基础上,确定了131390份政策文本,总字数为305648206.面向所构建的中文政策文本语料库,基于BERT-base-Chinese和Chinese-RoBERTa-wwm-ext,本研究利用MLM(masked language model)和WWM(whole word masking)任务构建了中文政策文本预训练模型(ChpoBERT),并在Github上对该模型进行了开源.在困惑度评价指标和政策文本自动分词、词性自动标注、命名实体识别下游任务上,ChpoBERT系列模型均表现出了较优的性能,可为政策文本的智能知识挖掘提供领域化的基础计算资源支撑.
ChpoBERT:A Pre-trained Model for Chinese Policy Texts
With the rapid development of deep learning and the accumulation of domain data,domain-based pre-trained models play an increasingly important supporting role in knowledge organization and mining.Aimed at massive Chinese policy texts,the pre-trained model of Chinese policy texts combined with the corresponding pre-trained strategies not only helps to improve the level of intelligent processing of Chinese policy texts,but also lays a solid foundation for the refine-ment,multi-dimensional analysis,and exploration of policy texts driven by data.For the national,provincial,and munici-pal policy texts,131,390 policy texts with a total number of 305,648,206 Chinese words were obtained through the combi-nation of automatic capture and manual assistance by removing non-policy text.This study develops a Chinese policy text pre-training model(ChpoBERT)for the constructed Chinese policy text corpus,which is based on the Chinese-RoBERTa-wwm-ext and BERT-base-Chinese.The model is open source and is available on Github.In terms of the evaluation indices of perplexity and downstream tasks of automatic word segmentation,automatic part-of-speech tagging,and named entity recognition of policy texts,the constructed ChpoBERT models showed better performance,which can provide basic com-puting resource support for the domain of intelligent knowledge mining of policy texts.

BERTpre-trained modelpolicy textdeep learningperplexity

沈思、陈猛、冯暑阳、许乾坤、刘江峰、王飞、王东波

展开 >

南京理工大学经济管理学院,南京 210094

南京农业大学信息管理学院,南京 210095

江苏省科技情报研究所,南京 210042

BERT 预训练模型 政策文本 深度学习 困惑度

国家自然科学基金面上项目

71974094

2023

情报学报
中国科学技术情报学会 中国科学技术信息研究所

情报学报

CSTPCDCSSCICSCDCHSSCD北大核心
影响因子:1.296
ISSN:1000-0135
年,卷(期):2023.42(12)
  • 17