基于提示学习的低资源藏文文本分类

Low-resource Tibetan Text Classification Based on Prompt Learning

扫码查看

原文链接

维普
万方数据

中文摘要：文本分类是自然语言处理的基础任务之一.标注数据不足一直是限制藏文及其他少数民族语言自然语言处理技术发展的重要原因,传统的深度学习模型对标注数据的规模有较高的要求.为解决这个问题,该文在大规模预训练语言模型的基础上,利用提示学习实现低资源藏文文本分类,即使用不同的藏文预训练语言模型和提示模板开展藏文文本分类实验.实验结果表明,通过设计合理的提示模板等方式,提示学习能够在训练数据不足的情况下提升藏文文本分类的效果(48.3％),初步验证了提示学习在民族语言处理中的价值和潜力.但是,实验结果也反映出提示学习模型在处理部分类别时性能较差,且藏文预训练语言模型也有进一步提升空间.

外文摘要：Text classification is one of the fundamental tasks in natural language processing.The lack of labeled data has always been an important factor limiting the development of natural language processing technologies for Tibetan and other minority languages,as traditional deep learning models have higher requirements for the scale of labeled data.To address this issue,this paper implements low-resource Tibetan text classification using prompt learning based on pre-trained language models,which involves conducting Tibetan text classification experiments using dif-ferent Tibetan pre-trained language models and prompt templates.The experimental results show that,by designing reasonable prompt templates and other methods,prompt learning can improve the effectiveness of Tibetan text clas-sification(48.3％)in the case of insufficient training data,preliminarily verifying the value and potential of prompt learning in minority language processing.However,the experimental results also indicate that the prompt learning model may underperform in specific categories,suggesting there is still potential for enhancement in the Tibetan pre-trained language model.

外文关键词：

Tibetan text classification,pre-trained language modelprompt learningfew-shot learning

作者：

安波、赵维纳、龙从军

展开 >

作者单位：

中国社会科学院民族学与人类学研究所民族语言文化行为实验研究室,北京 100081

中国社会科学院中国少数民族语言研究中心,北京 100081

青海师范大学计算机学院省部共建藏语智能信息处理及应用国家重点实验室,青海西宁 810008

关键词：

藏文文本分类预训练语言模型提示学习小样本学习

基金：

国家社会科学基金省部共建藏语智能信息处理及应用国家重点实验室自主课题基金国家自然科学基金国家自然科学基金中国社会科学院数据练专项

项目编号：

22BTQ0102022-SKL-01262076233622660362024SJK017

出版年：

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

年,卷(期)：2024.38(2)

参考文献量39