首页|面向大语言模型的藏语指令数据集构建

面向大语言模型的藏语指令数据集构建

扫码查看
指令微调是增强大语言模型(LLMs)能力的关键技术,受到了学术界和工业界的广泛关注。目前针对英语、汉语等资源丰富的语种的大语言模型取得了超出预期的效果,其重要原因之一是依托丰富的语言资源构建的大规模指令数据集能够有效支撑目标任务的指令微调。而对于低资源语言,LLMs的相关研究与应用尚处于起步阶段。该文以藏语作为低资源语言的代表,研究了面向大语言模型指令微调的数据集构建方法。首先,通过收集网页及社交媒体上的藏语文本构成原始藏语数据,并对此数据进行过滤、去重等预处理,形成质量较好的藏语数据集;然后,根据不同数据的特点,有针对性地进行人工标注,形成高质量的指令数据集。此外,为了保证数据的多样性,该文收集部分高质量的中文指令数据集,采用基于翻译的方法来构造藏语指令数据集以作为人工标注数据的补充,最终形成了包含12个子任务的384K条藏语指令数据,并将数据开源用于相关科学研究。最后通过实验验证了该文发布的藏语指令数据集能够大幅提升大语言模型在藏语上的文本生成与理解能力。
Tibetan Instruction Datasets for Large Language Model
As a crucial technology to enhance the capabilities of large language models(LLMs),instruction tuning has received widespread attention from academia and industry.Research and applications related to LLMs are still in their infancy for low-resource languages because of the lack of extensive public data sets.This paper builds datasets for LLMs instruction tuning using Tibetan as a representative low-resource language.First,the original Tibetan cor-pus data is created by gathering Tibetan data from websites and public accounts,which is preprocessed to form high-quality Tibetan datasets.Then,according to the characteristics of different datasets,targeted manual annotation is carried out to form high-quality instruction datasets.In addition,the paper also gathers some excellent Chinese instruction datasets and uses a translation-based approach to create Tibetan instruction datasets as a supple-ment in order to guarantee diversity.Finally,a total of 384K Tibetan instruction datasets containing 12 subtasks are formed,which will be released for related research.The final experimental results show that the released Tibetan datasets can significantly enhance LLMs'performance.

large language modelslow-resource languagetibetan datainstruction data

朱孟笑、沙九、冯冲

展开 >

北方工业大学信息学院,北京 100144

百度自然语言处理部,北京 100085

北京理工大学计算机学院,北京 100081

大语言模型 低资源语言 藏语数据 指令数据

2024

中文信息学报
中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心
影响因子:0.8
ISSN:1003-0077
年,卷(期):2024.38(12)