首页|基于RoBERTa-BiLSTM-CRF的藏文新闻要素识别

基于RoBERTa-BiLSTM-CRF的藏文新闻要素识别

扫码查看
新闻要素识别是从新闻文本中提取时间、地点、人物、组织机构、事件等关键信息实体的过程,是新闻内容分析的基础.文章将藏文新闻要素分类细化为10类,并提出一种基于RoBERTa-BiLSTM-CRF的藏文新闻要素识别方法.该方法首先通过RoBERTa预训练语言模型对藏文新闻文本进行编码,然后通过BiLSTM和自注意力机制进行特征提取,最后采用条件随机场进行序列标注,完成对新闻要素的识别和分类.在自建数据集(Tibetan news)上进行实验后F1值达到88.8%.
Study on Identification of Tibetan News Element Based on RoBERTa-BiLSTM-CRF
News element recognition is a process of extracting key information entities such as time,location,people,organizations,and events from news texts,serving as the foundation for news content analysis.While sig-nificant progress has been made for Chinese news element recognition,few studies have been conducted for Ti-betan news and the existing element classification systems are rather coarse,making it difficult to comprehensive-ly cover various key information in Tibetan news reports.Therefore,in this paper,the element classification of Ti-betan news is refined into 10 categories.Meanwhile,addressing the challenges in Tibetan news texts such as un-clear word boundaries,numerous out-of-vocabulary words,and word polysemy,we propose a Tibetan news ele-ment recognition method based on RoBERTa-BiLSTM-CRF.This method first encodes Tibetan news texts using the RoBERTa pre-trained language model,then extracts features through BiLSTM and self-attention mecha-nism,and finally employs conditional random fields for sequence labeling to complete the recognition and classi-fication of news elements.Experiments conducted on our self-built dataset(Tibetan news)demonstrate the effec-tiveness of this method,achieving an F1 score of 88.8%.

Tibetannews elementsidentifydeep learningRoBERTa

香前、才藏太、李措

展开 >

青海师范大学计算机学院 青海 西宁 810016

藏文信息处理教育部重点实验室 青海 西宁 810008

省部共建藏语智能信息处理及应用国家重点实验室 青海 西宁 810008

藏文 新闻要素 识别 深度学习 RoBERTa

2024

高原科学研究

高原科学研究

ISSN:
年,卷(期):2024.8(4)