Tibetan Pre-training Language Model Combined with Data Enhancement Method
Recently,in the field of natural language processing,pre training plus fine tuning has become a new para-digm.This paper collects and arranges the Tibetan text corpus containing 4.655 billion characters,then pretrained by a Tibetan language model via the UniLM model,enhanced by the Tibetan text features.Experiments show that this method has achieved remarkable results in four downstream tasks,such as Tibetan La case sentence classifica-tion and Tibetan text classification.
Tibetan pre-training language modeltext data enhancement methodUniLM model