藏文情感语料库的构建及自动标注方法研究

Research on the construction and automatic annotation method of Tibetan sentiment corpus

尖羊措 ¹安见才让¹

扫码查看

作者信息

1. 青海民族大学计算机学院,青海西宁 810007;省部共建藏语智能信息处理及应用国家重点实验室;青海省藏文信息处理与机器翻译重点实验室
折叠

摘要

针对藏文情感分析领域中缺乏相应的基础训练语料库、模型又需要大量的数据做支撑、传统的人工标注需要耗费大量的人力物力资源且普适性不高的情况,构建了细粒度的藏文情感语料库和情感词典.首先由三人分别对每一个词进行情感强度标注,其次将语料和词典按规则进行匹配,最后以情感强度平均得分来表示文本的情感类别.本文所构建的细粒度情感语料资源,在一定程度上能够缩短海量标注语料库的开发周期,并降低语料标注的人工成本.

Abstract

In the field of Tibetan sentiment analysis,there are problems such as a lack of corresponding basic training corpus,the need for a large amount of data to support models,the consumption of a lot of human and material resources and low universality for traditional manual annotation.To this end,a fine-grained Tibetan sentiment corpus and sentiment dictionary are constructed.Firstly,each word is annotated with sentiment intensity by three individuals.Then,the corpus and dictionary are matched according to the rules.Finally,the average score of sentiment intensity is used to represent the sentiment category of the text.The fine-grained sentiment corpus resources constructed in this paper can,to some extent,shorten the development cycle of massive annotated corpus and reduce the labor cost of corpus annotation.

关键词

藏文情感语料库/细粒度情感/情感强度/自动标注

Key words

Tibetan sentiment corpus/fine-grained sentiment/sentiment intensity/automatic annotation

引用本文复制引用

基金项目

省部共建藏语智能信息处理及应用国家重点实验室/青海省藏文信息处理与机器翻译重点实验室开放课题(2021-Z-001)

青海民族大学计算机学院研究生创新项目(09M2022004)

出版年

2023

计算机时代

浙江省计算技术研究所　浙江省计算机学会

计算机时代

影响因子：0.411

ISSN：1006-8228

参考文献量5

段落导航