Decoupled Knowledge Distillation Based on Diffusion Model
Knowledge distillation(KD)is a technique that transfers knowledge from a complex model(teacher model)to a simpler model(student model).While many popular distillation methods currently focus on intermediate feature layers,response-based knowledge distillation(RKD)has regained its position among the SOTA models after decoupled knowledge distillation(DKD)was introduced.RKD leverages strong consistency constraints to split classic knowledge distillation into two parts,addressing the issue of high coupling.However,this approach overlooks the significant representation gap caused by the disparity in teacher-student network architectures,leading to the problem where smaller student models cannot effectively learn knowledge from teacher models.To solve this problem,this study proposes a diffusion model to narrow the representation gap between teacher and student models.This model transfers teacher features to train a lightweight diffusion model,which is then used to denoise the student model,thus reducing the representation gap between teacher and student models.Extensive experiments demonstrate that the proposed model achieves significant improvements over baseline models on CIFAR-100 and ImageNet datasets,maintaining good performance even when there is a large gap in teacher-student network architectures.