基于MEGA网络和分层预测的标点恢复方法

扫码查看

原文链接

万方数据
维普

中文摘要：标点恢复又称标点预测,是指对一段没有标点的文本添加合适的标点,以提高文本的可读性,是一项经典的自然语言处理任务.随着预训练模型的发展和标点恢复研究的深入,标点恢复任务的性能在不断提升.然而,基于Transformer结构的预训练模型在提取长序列输入的局部信息方面存在局限性,不利于最终标点符号的预测.此外,以往的研究将标点标签视为要预测的符号,忽略了不同标点的场景属性和标点间的关系.为了解决这些问题,引入移动平均门控注意力(MEGA)网络作为辅助模块,以增强模型对局部信息的提取能力.同时,构建分层预测模块,充分利用不同标点符号的场景属性和标点间的关系进行最终的分类.使用多种基于Transformer结构的预训练模型在不同语言的数据集上进行实验,在英文标点数据集IWSLT上的实验结果表明,在多数预训练模型上应用MEGA模块和分层预测模块都能获得性能增益,使用DeBERTaV3 xlarge在IWSLT的REF测试集上的F1值达到85.5％,相比于基线提升了 1.2个百分点.此外,在中文标点数据集的实验中也取得较高的精度.

外文标题：Punctuation Restoration Method Based on MEGA Network and Hierarchical Prediction

外文摘要：Punctuation restoration,also known as punctuation prediction,refers to the task of adding appropriate punctuation marks to a text without punctuation to enhance its readability.This is a classic Natural Language Processing(NLP)task.In recent years,with the development of pretraining models and deepening research on punctuation restoration,the performance of punctuation restoration tasks has continuously improved.However,Transformer-based pretraining models have limitations in extracting local information from long-sequence inputs,which hinders the prediction of the final punctuation marks.In addition,previous studies have treated punctuation labels as symbols to be predicted by overlooking the contextual attributes of different punctuation marks and their relationships.To address these issues,this study introduces a Moving average Equipped Gated Attention(MEGA)network as an auxiliary module to enhance the ability of the model to extract local information.Moreover,a hierarchical prediction module is constructed to fully utilize the contextual attributes of different punctuation marks and the relationships between them for the final classification.Experiments are conducted using various transformer-based pretraining models on datasets in different languages.The experimental results on the English punctuation dataset IWSLT demonstrate that applying the MEGA and hierarchical prediction modules to most pretraining models leads to performance gains.Notably,DeBERTaV3 xlarge achieved an F1 score of 85.5％on the REF test set of the IWSLT,which is a 1.2 percentage points improvement compared to the baseline.The proposed model achieved the highest accuracy for the Chinese punctuation dataset.

外文关键词：

punctuation restorationNatural Language Processing(NLP)pretrained modelTransformer structurehierarchical prediction

作者：

张文博、黄浩、吴迪、唐敏杰

展开 >

作者单位：

新疆大学计算机科学与技术学院,新疆乌鲁木齐 830046

新疆多语种信息技术重点实验室,新疆乌鲁木齐 830046

关键词：

标点恢复自然语言处理预训练模型 Transformer结构分层预测

出版年：

2024

DOI：

10.19678/j.issn.1000-3428.0068599

计算机工程

华东计算技术研究所　上海市计算机学会

计算机工程

CSTPCD北大核心

影响因子：0.581

ISSN：1000-3428

年,卷(期)：2024.50(12)