高原科学研究2024,Vol.8Issue(4) :108-114.DOI:10.16249/j.cnki.2096-4617.2024.04.012

基于RoBERTa-BiLSTM-CRF的藏文新闻要素识别

Study on Identification of Tibetan News Element Based on RoBERTa-BiLSTM-CRF

香前 才藏太 李措
高原科学研究2024,Vol.8Issue(4) :108-114.DOI:10.16249/j.cnki.2096-4617.2024.04.012

基于RoBERTa-BiLSTM-CRF的藏文新闻要素识别

Study on Identification of Tibetan News Element Based on RoBERTa-BiLSTM-CRF

香前 1才藏太 1李措1
扫码查看

作者信息

  • 1. 青海师范大学计算机学院 青海 西宁 810016;藏文信息处理教育部重点实验室 青海 西宁 810008;省部共建藏语智能信息处理及应用国家重点实验室 青海 西宁 810008
  • 折叠

摘要

新闻要素识别是从新闻文本中提取时间、地点、人物、组织机构、事件等关键信息实体的过程,是新闻内容分析的基础.文章将藏文新闻要素分类细化为10类,并提出一种基于RoBERTa-BiLSTM-CRF的藏文新闻要素识别方法.该方法首先通过RoBERTa预训练语言模型对藏文新闻文本进行编码,然后通过BiLSTM和自注意力机制进行特征提取,最后采用条件随机场进行序列标注,完成对新闻要素的识别和分类.在自建数据集(Tibetan news)上进行实验后F1值达到88.8%.

Abstract

News element recognition is a process of extracting key information entities such as time,location,people,organizations,and events from news texts,serving as the foundation for news content analysis.While sig-nificant progress has been made for Chinese news element recognition,few studies have been conducted for Ti-betan news and the existing element classification systems are rather coarse,making it difficult to comprehensive-ly cover various key information in Tibetan news reports.Therefore,in this paper,the element classification of Ti-betan news is refined into 10 categories.Meanwhile,addressing the challenges in Tibetan news texts such as un-clear word boundaries,numerous out-of-vocabulary words,and word polysemy,we propose a Tibetan news ele-ment recognition method based on RoBERTa-BiLSTM-CRF.This method first encodes Tibetan news texts using the RoBERTa pre-trained language model,then extracts features through BiLSTM and self-attention mecha-nism,and finally employs conditional random fields for sequence labeling to complete the recognition and classi-fication of news elements.Experiments conducted on our self-built dataset(Tibetan news)demonstrate the effec-tiveness of this method,achieving an F1 score of 88.8%.

关键词

藏文/新闻要素/识别/深度学习/RoBERTa

Key words

Tibetan/news elements/identify/deep learning/RoBERTa

引用本文复制引用

出版年

2024
高原科学研究

高原科学研究

CSCD
ISSN:
段落导航相关论文