Domain-specific Pretrained Language Model in Public Health Domain
The suddenness,variability,and unpredictability of public health emergencies have increased the difficulty of processing and monitoring public health information.Constructing domain-specific pretrained model can enhance the performance of downstream tasks.Currently,there are some enhanced pre-training models in the field of public health that focus on social media and medical domains.However,these models have small training corpora,limited data sources,and short text lengths,lacking long text unsupervised corpora with rich semantic information.To address this issue,we adopt the BERT model to perform adaptive pre-training on a large-scale corpus of public health news in order to construct a domain-specific pretrained model,called PHD-News-BERT,which is suitable for deep semantic learning and facilitates learning tasks in the field.By conducting experiments on eight datasets from five downstream tasks and comparing them with five baseline models,the results demonstrate that PHD-News-BERT achieves significant performance in most tasks,indicating its excellent generalization and robustness.It is expected to introduce new benchmarks for future work in the field of public health.
public healthdomain-specific pretrained language modelnatural language processingBERTadaptive pre-training