首页|面向公共卫生领域的语言模型预训练

面向公共卫生领域的语言模型预训练

扫码查看
公共卫生事件的突发性、变化性及不确定性加大了公共卫生信息处理与监测的难度,而构建领域预训练模型则可提升其下游任务的效果。目前国内外已有一些专注于社交媒体和医学领域的增强式公共卫生领域预训练模型,但是这些模型的训练语料规模小、来源单一、文本长度短小,缺乏包含丰富语义信息的长文本无监督语料。为了解决这一问题,该文采用基础BERT模型对获取的大规模公共卫生领域新闻语料进行自适应预训练,并构建了一个适用深度语义学习的领域预训练模型PHD-News-BERT,以更好地进行该领域相关任务的学习。通过在5种下游任务的8个数据集上与5种基线模型进行实验比较,结果显示,PHD-News-BERT在大多数任务中取得了显著的性能,表明其具有良好的泛化性和鲁棒性。预期可为公共卫生领域的未来工作引入新的基准。
Domain-specific Pretrained Language Model in Public Health Domain
The suddenness,variability,and unpredictability of public health emergencies have increased the difficulty of processing and monitoring public health information.Constructing domain-specific pretrained model can enhance the performance of downstream tasks.Currently,there are some enhanced pre-training models in the field of public health that focus on social media and medical domains.However,these models have small training corpora,limited data sources,and short text lengths,lacking long text unsupervised corpora with rich semantic information.To address this issue,we adopt the BERT model to perform adaptive pre-training on a large-scale corpus of public health news in order to construct a domain-specific pretrained model,called PHD-News-BERT,which is suitable for deep semantic learning and facilitates learning tasks in the field.By conducting experiments on eight datasets from five downstream tasks and comparing them with five baseline models,the results demonstrate that PHD-News-BERT achieves significant performance in most tasks,indicating its excellent generalization and robustness.It is expected to introduce new benchmarks for future work in the field of public health.

public healthdomain-specific pretrained language modelnatural language processingBERTadaptive pre-training

王连喜、胡冠锋

展开 >

广东外语外贸大学信息科学与技术学院,广东 广州 510006

公共卫生 领域预训练模型 自然语言处理 BERT 自适应预训练

2024

计算机技术与发展
陕西省计算机学会

计算机技术与发展

CSTPCD
影响因子:0.621
ISSN:1673-629X
年,卷(期):2024.34(12)