大数据2024,Vol.10Issue(3) :119-132.DOI:10.11959/j.issn.2096-0271.2023039

面向自然语言理解的多教师BERT模型蒸馏研究

Multi-teacher distillation BERT model in NLU tasks

石佳来 郭卫斌
大数据2024,Vol.10Issue(3) :119-132.DOI:10.11959/j.issn.2096-0271.2023039

面向自然语言理解的多教师BERT模型蒸馏研究

Multi-teacher distillation BERT model in NLU tasks

石佳来 1郭卫斌1
扫码查看

作者信息

  • 1. 华东理工大学信息科学与工程学院,上海 200237
  • 折叠

摘要

知识蒸馏是一种常用于解决BERT等深度预训练模型规模大、推断慢等问题的模型压缩方案.采用"多教师蒸馏"的方法,可以进一步提高学生模型的表现,而传统的对教师模型中间层采用的"一对一"强制指定的策略会导致大部分的中间特征被舍弃.提出了一种"单层对多层"的映射方式,解决了知识蒸馏时中间层无法对齐的问题,帮助学生模型掌握教师模型中间层中的语法、指代等知识.在GLUE中的若干数据集的实验表明,学生模型在保留了教师模型平均推断准确率的93.9%的同时,只占用了教师模型平均参数规模的41.5%.

Abstract

Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model. The method of "multi-teacher distillation" can further improve the performance of the student model, while the traditional "one-to-one" mapping method mandatory assignment strategy for the middle layer of the teacher model will lead to the abandonment of most of the middle features. The "one-to-many" mapping method is proposed to solve the problem that the middle layer cannot be aligned during knowledge distillation, and help students master the grammar, reference and other knowledge in the middle layer of the teacher model. Experiments on several data sets in GLUE show that the student model retains 93.9% of the average inference accuracy of the teacher model, while only accounting for 41.5% of the average parameter size of the teacher model.

关键词

深度预训练模型/BERT/多教师蒸馏/自然语言理解

Key words

deep pre-training model/BERT/multi-teacher distillation/nature language understanding

引用本文复制引用

基金项目

国家自然科学基金(62076094)

出版年

2024
大数据
人民邮电出版社

大数据

CSTPCD
ISSN:2096-0271
段落导航相关论文