首页|多层级特征融合与双教师协作的知识蒸馏

多层级特征融合与双教师协作的知识蒸馏

扫码查看
目的 知识蒸馏旨在不影响原始模型性能的前提下,将一个性能强大且参数量也较大的教师模型的知识迁移到一个轻量级的学生模型上。在图像分类领域,以往的蒸馏方法大多聚焦于全局信息的提取而忽略了局部信息的重要性。并且这些方法多是围绕单教师架构蒸馏,忽视了学生可以同时向多名教师学习的潜力。因此,提出了一种融合全局和局部特征的双教师协作知识蒸馏框架。方法 首先随机初始化一个教师(临时教师)与学生处理全局信息进行同步训练,利用其临时的全局输出逐步帮助学生以最优路径接近教师的最终预测。同时又引入了一个预训练的教师(专家教师)处理局部信息。专家教师将局部特征输出分离为源类别知识和其他类别知识并分别转移给学生以提供较为全面的监督信息。结果 在CIFAR-100(Canadian Institute for Advanced Research)和Tiny-ImageNet数据集上进行实验并与其他蒸馏方法进行了比较。在CIFAR-100数据集中,与最近的NKD(normalized knowledge distillation)相比,在师生相同架构与不同架构下,平均分类准确率分别提高了 0。63%和1。00%。在Tiny-ImageNet 数据集中,ResNet34(residual network)和 MobileNetV 1 的师生组合下,分类准确率相较于 SRRL(knowledge distillation via softmax regression representation learning)提高了 1。09%,相较于 NKD提高了 1。06%。同时也在 CIFAR-100数据集中进行了消融实验和可视化分析以验证所提方法的有效性。结论 本文所提出的双教师协作知识蒸馏框架,融合了全局和局部特征,并将模型的输出响应分离为源类别知识和其他类别知识并分别转移给学生,使得学生模型的图像分类结果具有更高的准确率。
Knowledge distillation of multi-level feature fusion and dual-teacher collaboration
Objective Knowledge distillation aims to transfer the knowledge of a teacher model with a powerful performance and a large number of parameters to a lightweight student model and improve its performance without affecting the perfor-mance of the original model.Previous research on knowledge distillation mostly focus on the direction of knowledge distilla-tion from one teacher to one student and neglect the potential for students to learn from multiple teachers simultaneously.Multi-teacher distillation can help the student model synthesize the knowledge of each teacher model,thereby improving its expressive ability.A few studies have examined the distillation of teacher models across these different situations,and learning from multiple teachers at the same time can integrate additional useful knowledge and information and conse-quently improve student performance.In addition,most of the existing knowledge distillation methods only focus on the global information of the image and ignore the importance of spatial local information.In image classification,local infor-mation refers to the features and details of specific regions in the image,including textures,shapes,and boundaries,which play important roles in distinguishing various image categories.The teacher network can distinguish local regions based on these details and make accurate predictions for similar appearances in different categories,but the student net-work may fail to predict.To address these issues,this article proposes a knowledge distillation method based on global and local dual-teacher collaboration,which integrates global and local information and effectively improve the classification accuracy of the student model.Method The original input image is initially represented as global and local image views.The original image(global image view)is randomly cropped locally,and the ratio of the cropped area to the original image is specified within 40%~70%to obtain local input information(local image view).Afterward,a teacher(scratch teacher)is randomly initialize to synchronize training with the student in processing global information,and its scratch global feature output is used to gradually help students approach the teacher's final prediction with the optimal path.Meanwhile,a pre-trained teacher(expert teacher)is introduced to process local information.The proposed method uses a dual-teacher distil-lation architecture to jointly train the student network on the premise of integrating global and local features.On the one hand,the scratch teacher works with the student to train and process global information from scratch.By introducing the scratch teacher,it is no longer just the final smooth output of the pre-trained model(expert teacher);instead,it uses its temporary output to gradually help the student model,thus forcing the latter to approach the final output logits with higher accuracy through the optimal path.During the training process,the student model obtains not only the difference between the target and the scratch output but also the possible path to the final goal provided by a complex model with strong learn-ing ability.On the other hand,the expert teacher processes local information and separates the output local features into source category knowledge and other category knowledge.In this collaborative teaching,the student model reaches a local optimum,and its performance becomes close to that of the teacher model.Result The proposed method is compared with other knowledge distillation methods being used in the field of image classification.The experimental datasets include CIFAR-100 and Tiny-ImageNet,and image classification accuracy is used as the evaluation index.On the CIFAR-100 dataset,compared with the optimal feature distillation method SemCKD,the average distillation accuracy of the proposed method increased by 0.62%under the same architecture of teachers and students.In the case of heterogeneous teachers and students,the average accuracy rate of the proposed method increased by 0.89%.Compared with the state-of-the-art response distillation method NKD,the average classification accuracy of the proposed method increased by 0.63%and 1.00%in the cases of homogeneous and heterogeneous teachers and students,respectively.On the Tiny-ImageNet data-set,the teacher network is ResNet34,and the student network is ResNet18.The final test accuracy of the proposed method reached its optimal level at 68.86%,which was 0.74%higher than that of NKD and other competing models.This method also achieved the highest classification accuracy in the case of different teacher and student architecture combinations.Ablation experiments and visual analysis are also conducted on CIFAR-100 to demonstrate the effectiveness of the proposed method.Conclusion A dual-teacher collaborative knowledge distillation framework that integrates global and local informa-tion is proposed in this paper.This method separates the teacher-student output features into source categories and other category knowledge and transfers them to the students separately.Experimental results show that the proposed method out-performs several state-of-the-art knowledge distillation methods in the field of image classification and can significantly improve the performance of the student model.

knowledge distillation(KD)image classificationlightweight modelcollaborative distillationfeature fusion

王硕、余璐、徐常胜

展开 >

天津理工大学计算机科学与工程学院,天津 300382

中国科学院自动化研究所多模态人工智能系统全国重点实验室,北京 100190

知识蒸馏(KD) 图像分类 轻量级模型 协作蒸馏 特征融合

2024

中国图象图形学报
中国科学院遥感应用研究所,中国图象图形学学会 ,北京应用物理与计算数学研究所

中国图象图形学报

CSTPCD北大核心
影响因子:1.111
ISSN:1006-8961
年,卷(期):2024.29(12)