A brief overview of intellectual distillation

Preface:

Knowledge distillation is a method of distilling knowledge from tedious models and condensing it into a single model so that it can be deployed to practical applications. Geoffrey Hinton, the godfather of AI, and two of his colleagues at Google, Oriol Vinyals and Jeff Dean, introduced intellectual distillation in 2015.

Knowledge distillation refers to the transfer of learning behavior from an unwieldy model (the teacher) to a smaller model (the student), where the output generated by the teacher is used as a “soft target” for training the student. By applying this approach, the authors found that they achieved surprising results on the MNIST dataset and showed that significant improvements can be achieved by extracting knowledge from the model integration into a single model.

Knowledge distillation is used for image classification

Hinton and his two co-authors first describe their distillation of knowledge on the task of image classification: distilling knowledge from neural networks.

As described in this paper, the simplest form of knowledge distillation is to train distillation models on a transfer set with a soft target distribution. By now, we should know that there are two objectives for training the student model. One is the correct label (hard target) and the other is the soft label (soft target) generated from the teacher network.

Therefore, the objective function is the weighted average of two different objective functions. The first objective function is the cross entropy loss between student prediction and soft target, and the second objective function is the cross entropy loss between student output and correct label. The authors also mention that the best results are usually obtained by using a lower weight on the second objective function.

The experimental results are shown below

Knowledge distillation is used for target detection

Guobin Chen and his co-authors published in NeurIPS 2017 their object detection study combining knowledge distillation with cue learning, which uses knowledge distillation to learn effective object detection models.

In their approach, they also use hints, which are characteristic maps taken from the middle layer of the teacher and used to guide students to understand the teacher’s behavior as best as possible. Furthermore, in order to achieve optimal distillation knowledge, there must be an adaptation layer, which will be discussed later. Ftp-rcnn is the target detection network used in the experiment in this paper. Their learning scheme is shown below:

The objective function is as follows:

RCN and RPN represent regression classification network and regional proposal network respectively. N and M are the batch sizes of RCN and RPN respectively. L_RCN, L_RPN, and L_Hint are losses of RCN, RPN, and hint respectively; λ (usually 1) and γ (usually set to 0.5) are the hyperparameters used to control the final loss.

Hint learning

In FitNets: Hints for Thin Deep Nets, Adriana Romero demonstrates that the performance of student networks can be improved by using intermediate representations of teacher networks as Hints to help students with their training. In this sense, the loss between the cue feature Z (the feature map obtained from the teacher’s middle layer) and the guide feature V (the feature map of the student’s middle layer) is calculated using L1 or L2 distance.

The figure below shows a feature graph extracted from a pre-trained YOLOv4 model trained on WAYMO datasets, which is one of my projects related to knowledge distillation for target detection. In these examples, the input image is resized to 800×800.

Knowledge distillation and prompt learning

The use of Hint Learning requires that the Hint function and the boot function have the same shape (height x width x channel). Similarly, cue features and guide features will never be in similar feature Spaces, so an adaptive layer (usually a 1×1 convolution layer) is used to help improve the transfer of knowledge from teacher to student.

The figure below depicts the learning scheme I am working on in the target detection project, where I am using a small network with three detection levels to extract knowledge from pre-trained YOLOv4.

Guobin Chen shows excellent results in target detection by combining knowledge distillation with Hint learning.

Conclusion

In this article, I briefly introduced knowledge distillation and Hint Learning. Knowledge distillation is considered to be an effective method to transform complex model integration knowledge into smaller distillation models. Hint learning combined with knowledge distillation is a very powerful scheme for improving the performance of neural networks.

Original link:

Leantran.medium.com/a-gentle-in…

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Recently, the public account (CV technical guide) all technical summary packaged into a PDF, in the public account reply keyword “technical summary” can be obtained.

Other articles

Summary of computer vision terms (a) to build the knowledge system of computer vision

Summary of under-fitting and over-fitting techniques

Summary of normalization methods

Summary of common ideas of paper innovation

Summary of efficient Reading methods of English literature in CV direction

A review of small sample learning in computer vision

A brief overview of intellectual distillation

Optimize the read speed of OpenCV video

NMS summary

Loss function technology summary

Summary of attention mechanism technology

Summary of feature pyramid technology

Summary of pooling techniques

Summary of data enhancement methods

Summary of CNN structure Evolution (I) Classical model

Summary of CNN structural evolution (II) Lightweight model

Summary of CNN structure evolution (iii) Design principles

How to view the future trend of computer vision

Summary of CNN visualization technology (I) – feature map visualization

Summary of CNN visualization Technology (ii) – Convolutional kernel visualization

CNN Visualization Technology Summary (iii) – Class visualization

CNN Visualization Technology Summary (IV) – Visualization tools and projects

A brief overview of intellectual distillation

Knowledge distillation is used for image classification

Knowledge distillation is used for target detection

Hint learning

Knowledge distillation and prompt learning

Conclusion

Related Posts

Why does Goldman Sachs think China will surpass the US in AI?

Pure PyTorch With SpeechBrain, Kaldi: I’m a little stressed out

Image an easy way to install PyTorch