Aligning Sizes of Intermediate Layers by LoRA Adapter for Knowledge Distillation
Published in The Sixth Workshop on Insights from Negative Results in NLP, 2025
Abstract
Intermediate Layer Distillation (ILD) is a variant of Knowledge Distillation (KD), a method for compressing neural networks.ILD requires mapping to align the intermediate layer sizes of the teacher and student models to compute the loss function in training, while this mapping is not used during inference.This inconsistency may reduce the effectiveness of learning in intermediate layers.In this study, we propose LoRAILD, which uses LoRA adapters to eliminate the inconsistency.However, our experimental results show that LoRAILD does not outperform existing methods.Furthermore, contrary to previous studies, we observe that conventional ILD does not outperform vanilla KD.Our analysis of the distilled models’ intermediate layers suggests that ILD does not improve language models’ performance.
Recommended citation:
Takeshi Suzuki, Hiroaki Yamada, and Takenobu Tokunaga. 2025. Aligning Sizes of Intermediate Layers by LoRA Adapter for Knowledge Distillation. In The Sixth Workshop on Insights from Negative Results in NLP, pages 100-105.