面向公共卫生领域的语言模型预训练

扫码查看

原文链接

万方数据
维普

中文摘要：公共卫生事件的突发性、变化性及不确定性加大了公共卫生信息处理与监测的难度,而构建领域预训练模型则可提升其下游任务的效果.目前国内外已有一些专注于社交媒体和医学领域的增强式公共卫生领域预训练模型,但是这些模型的训练语料规模小、来源单一、文本长度短小,缺乏包含丰富语义信息的长文本无监督语料.为了解决这一问题,该文采用基础BERT模型对获取的大规模公共卫生领域新闻语料进行自适应预训练,并构建了一个适用深度语义学习的领域预训练模型PHD-News-BERT,以更好地进行该领域相关任务的学习.通过在5种下游任务的8个数据集上与5种基线模型进行实验比较,结果显示,PHD-News-BERT在大多数任务中取得了显著的性能,表明其具有良好的泛化性和鲁棒性.预期可为公共卫生领域的未来工作引入新的基准.

外文标题：Domain-specific Pretrained Language Model in Public Health Domain

外文摘要：The suddenness,variability,and unpredictability of public health emergencies have increased the difficulty of processing and monitoring public health information.Constructing domain-specific pretrained model can enhance the performance of downstream tasks.Currently,there are some enhanced pre-training models in the field of public health that focus on social media and medical domains.However,these models have small training corpora,limited data sources,and short text lengths,lacking long text unsupervised corpora with rich semantic information.To address this issue,we adopt the BERT model to perform adaptive pre-training on a large-scale corpus of public health news in order to construct a domain-specific pretrained model,called PHD-News-BERT,which is suitable for deep semantic learning and facilitates learning tasks in the field.By conducting experiments on eight datasets from five downstream tasks and comparing them with five baseline models,the results demonstrate that PHD-News-BERT achieves significant performance in most tasks,indicating its excellent generalization and robustness.It is expected to introduce new benchmarks for future work in the field of public health.

外文关键词：

public healthdomain-specific pretrained language modelnatural language processingBERTadaptive pre-training

作者：

王连喜、胡冠锋

展开 >

作者单位：

广东外语外贸大学信息科学与技术学院,广东广州 510006

关键词：

公共卫生领域预训练模型自然语言处理 BERT 自适应预训练

出版年：

2024

DOI：

10.20165/j.cnki.ISSN1673-629X.2024.0266

计算机技术与发展

陕西省计算机学会

计算机技术与发展

CSTPCD

影响因子：0.621

ISSN：1673-629X

年,卷(期)：2024.34(12)