Tibetan Instruction Datasets for Large Language Model
As a crucial technology to enhance the capabilities of large language models(LLMs),instruction tuning has received widespread attention from academia and industry.Research and applications related to LLMs are still in their infancy for low-resource languages because of the lack of extensive public data sets.This paper builds datasets for LLMs instruction tuning using Tibetan as a representative low-resource language.First,the original Tibetan cor-pus data is created by gathering Tibetan data from websites and public accounts,which is preprocessed to form high-quality Tibetan datasets.Then,according to the characteristics of different datasets,targeted manual annotation is carried out to form high-quality instruction datasets.In addition,the paper also gathers some excellent Chinese instruction datasets and uses a translation-based approach to create Tibetan instruction datasets as a supple-ment in order to guarantee diversity.Finally,a total of 384K Tibetan instruction datasets containing 12 subtasks are formed,which will be released for related research.The final experimental results show that the released Tibetan datasets can significantly enhance LLMs'performance.
large language modelslow-resource languagetibetan datainstruction data