T-Transformer-XL和T-XLNet:两个藏语预训练模型

T-Transformer-XL and T-XLNet:Two Tibetan pretraining models

扫码查看

原文链接

万方数据
维普

中文摘要：针对藏文在语料资源相对有限、可用于训练的预训练模型较为稀缺的问题,建立两个具有强编码能力的预训练模型:T-Transformer-XL和T-XLNet,并在自建大型藏语数据集T-News上分别进行训练.根据藏文文字的特殊结构,利用Sentence Piece分词模型中的字节对编码对藏文数据进行分词处理,并调整分词策略和目标函数解决不同算力和不同应用场景下的藏文生成问题.对T-Transformer-XL模型进行循环机制匹配和相对位置编码匹配,以有效建模长文本的上下文特征,对T-XLNet模型进行排列语言建模匹配,采用两种状态的自注意力机制提取文本特征.最后,通过基于自监督流形基数据增强方法,利用掩码语言模型生成逼真的增强样本,以丰富预训练模型的输出文本.实验结果表明,T-Transformer-XL和T-XLNet在文本生成任务中表现出色,可以根据具体的任务需求、可用的计算资源及模型性能的要求合理选择具体模型,实现最佳的应用效果.

外文摘要：To address the problem of limited Tibetan corpus resources and the scarcity of availa-ble pre-trained models for training,two pre-trained models with strong encoding capabilities are established:T-Transformer-XL and T-XLNet.These models are trained on a self-built large-scale Tibetan dataset,T-News.Considering the unique structure of the Tibetan script,the byte-pair encoding in the Sentence Piece tokenization model is used for tokenizing the Tibetan data.The tokenization strategy and objective function are adjusted to solve the Tibetan text generation problem under different computational power and application scenarios.The cyclic mechanism matching and the relative position encoding matching are performed on the T-Transformer-XL model to effectively model the contextual features of long texts,while the T-XLNet model applies the permutation language modeling matching,using a two-state self-attention mechanism to ex-tract text features.Finally,a self-supervised manifold-based data augmentation method is em-ployed,using a masked language model to generate realistic augmented samples to enrich the out-put text of the pre-trained models.Experimental results show that T-Transformer-XL and T-XL-Net perform excellently in text generation tasks.Specific models can be selected based on the par-ticular task requirements,available computational resources,and performance demands of the model to achieve optimal application results.

外文关键词：

Tibetannatural langrage processingdeep neural networktext generationdata augmentation

作者：

贾星星、陆玉、杨龙飞、多拉、王道顺

展开 >

作者单位：

兰州大学数学与统计学院,甘肃兰州 730000

省部共建藏语智能信息处理及应用国家重点实验室,青海西宁 810000

青海省藏文信息处理与机器翻译重点实验室,青海西宁 810000

清华大学计算机技术与技术系,北京 100084

展开 >

关键词：

藏文自然语言处理:深度神经网络文本生成数据增强

基金：

国家自然科学基金项目省部共建藏语智能信息处理及应用国家重点实验室/青海省藏文信息处理与机器翻译重点实验室开放课题项目

项目编号：

619021762023-Z-004

出版年：

2024

DOI：

10.13682/j.issn.2095-6533.2024.04.011

西安邮电大学学报

西安邮电学院

西安邮电大学学报

CSTPCD

影响因子：0.795

ISSN：1007-3264

年,卷(期)：2024.29(4)