Text Classification by Combining Dynamic Mask Attention and Multi-teacher Multi-feature Knowledge Distillation
The knowledge distillation technique compresses knowledge from large-scale models into lightweight mod-els,improving the efficiency of text classification.This paper introduces a text classification model that combines a dynamic mask attention mechanism and multi-teacher,multi-feature knowledge distillation.It leverages knowledge sources from various teacher models,including Roberta and Electra,while considering semantic information across different feature layers.The dynamic mask attention mechanism adapts to varying data lengths,reducing interference from irrelevant padding.Experimental results on four publicly available datasets demonstrate that the student model(TinyBERT)distilled by the proposed method outperforms other benchmark distillation strategies.Remarkably,with only 1/10 of the teacher model's parameters and approximately half the average runtime,it a-chieves classification results comparable to the two teacher models,with only a marginal decrease in accuracy(4.18%and 3.33%)and F1 value(2.30%and 2.38%).The attention heat map indicates that the dynamic mask at-tention mechanism enhances focus on the effective information of the data.
dynamic masking attentionmultiple teachers multi featuresknowledge distillationtext classification