Author: Chen WenruCopy the code
1. In this paper,
When learning new knowledge, people can quickly learn similar knowledge according to the previous knowledge, and can not forget the previous knowledge. And machines, or neural networks, to be more precise, experience catastrophic forgetting as they learn new tasks. The solution to this problem is what we call continual learning. This paper focuses on some classical methods of continuous learning in recent years, aiming to better understand this problem, to solve this problem in depth, and to bring convenience for future work. Key words: catastrophic forgetting, continuous learning
2. The introduction
When learning new knowledge, people can quickly learn similar knowledge according to the previous knowledge, and can not forget the previous knowledge. And machines, or neural networks, to be more precise, have some problems as they learn new tasks — catastrophic forgetting, which means the model learns B on the new task, and then goes back to predict a on the old task. The problem of catastrophic forgetting is very serious, such as the abnormal detection of aircraft parts. If a new part is added and the previous detection method is forgotten, then once the problem occurs, it will be an unpredictable disaster. So this is called catastrophic forgetting. Therefore, the solution to the catastrophic problem needs to be put forward, which we call continual learning. Continuous learning, or lifelong learning, continual learning. Continuous learning refers to the hope that the model can solve the current task quickly and accurately based on the prior knowledge of the past as human beings, whereas the innate ability of human beings is like finding a needle in a haystack for the model. Continuous learning requires the ability to continue with previous learning, which is why it is also called lifelong learning, aptly named. Continuous learning different meta learning, different transfer learning, similar but different, the latter solves fast learning based on experience, for example, if you know 210=20, then you can learn 220=40 quickly. The focus of continuous learning is forgetting. This paper, after reading the literature about continual learning, has a general context of continual learning, which can provide better help for continuous learning after the actual application of continuous learning. The main idea of continual learning is to constrain the direction of the gradient. The methods introduced in this paper are all based on the gradient constraint, and the results are good and classical.
3. Elastic weight consolidation
The inspiration for Elastic Weight consolidation (EWC) comes from mammalian memory, and it has been found that mammalian brains may protect previously acquired knowledge through cortical circuitry to avoid catastrophic forgetting. In experiments, when a mouse needs to remember a line of skills, some synapses in the brain are strengthened (the number of dendritic spines in a single neuron increases). And even after subsequent learning of other tasks, these increased dendritic spines can be maintained so that the relevant abilities can still be retained months later. But when these dendritic spines are selectively erased, the relevant skills are forgotten. This suggests that the protection of these enhanced synapses is essential for the preservation of mission capability. And EWC, the main idea of this algorithm is based on the above findings. The specific approach is summarized as follows: not every node in the neural network has a great influence on the results. When learning new tasks, the effect of continuing learning can be achieved by reducing the weight of nodes that have too much influence on the old tasks.
3.1 The specific methods
Suppose we have two learning tasks A and B. Theta A theta B is the parameter in the model for these two tasks. Task A learns first and gets stable results. In order not to let the model forget about task A, we need to limit θA to A low error range. Thus, EWC can use θA as A secondary punishment while learning A new task, as shown in Figure 1. This process is similar to the spring pressure. For A, the spring strength should be increased so that only A larger punishment can change θA, so that the memory of task A can be better retained, while for B, the spring strength should be the same, so that the memory of task B can be better retained, so that the memory of both tasks can be retained. All parameters have different intensities, and the parameters that have the greatest impact on A’s task should have greater intensities. So how do you choose this intensity for each parameter?
3.2 Calculated strength
The author intends to calculate the probability of this intensity by probability. Given a data set D, calculate the conditional probability of θ with respect to D through the prior probability of θ :
The formula is derived from Bayes’ formula:The logp(θ│D) value of the above formula is actually, simply, the negative of the loss value for this problem: -L(θ). The above is just A derivation for one task parameter, assuming that there are now two tasks A and B. Then this formula can be rederived as:The left side is still the posterior probability of the parameters (given all the data), and the right side is only dependent on task B. And task A has to be A posterioriAbsorption. Since the posteriori probability is difficult to obtain, the author approximates the posteriori probability to the Gaussian distribution according to the Laplace approximation. The Gaussian distribution is the mean obtained by the parameter θA of task A and the diagonal accuracy is given by the diagonal of Fisher Information Matrix (F). F is proved to have the following three important characteristics: a)F is equivalent to the approximate minimum of the second derivative of the Loss function; B) It can be obtained only by the first derivative of Loss, so it is also easy to obtain it; C) He guarantees semi-positive definite.Fisher Information is a measure of the expected amount of information that an observation can provide about the unknown parameter θ. It’s a measure of the strength of the spring. When the training of task B is completed, A C comes, and then A and B can be used as tasks, and so on.
3.2 Supervised training
We set up a multi-layer fully connected neural network for training multiple supervised tasks. Shuffle data and do small batch. Each training task has a fixed number of training times and cannot be increased. In Figure A, it can be seen that EWC performs very well and can remember the previous task, but SGD shows signs of forgetting the previous task in each task, while L2 regularization has catastrophic forgetting (in training task C, task B). The author also compared task SGD separately, and when the number of tasks was increased, this memory plummeted, as shown in Figure B. Figure C shows the impact of task similarity on fisher matrix overlap.
4. Other relevant practices
4.1 LWF
Learning without Forgetting is the name of LWF. Its main idea is to deal with catastrophic Forgetting through knowledge distillation. As you can see, the normal training model looks like this. withTheta.S represents the model parameters of the previous feature extraction, and θo represents the parameters of the layer used for classification.At first, the paper lists the existing schemes: the following three, as well as LWF.Fine-tuning and Feature Extraction are actually applicable to similar tasks, that is, data sets are basically similar. However, Joint Training requires the old data set, which is not allowed under some conditions. For example, the data needs privacy protection, and the data is too large to be saved at the same time. So how do you do that? The first step is pre-training, allowing the newly added θn to converge, and then co-learning using knowledge distillation.This is a Loss for the new task, which is the normal cross entropy Loss (MSE). Then knowledge distillation is done by adding loss to the original model, i.e. :This refers to the label generated by the current model and the label generated by the original model. This is an improved Loss function for cross entropy Loss. Among them:The purpose is to increase the weight of the small sample size. The following process is represented by a pseudo-code algorithm:The process is clear, and the last line is the crux of this article. R refers to some regularization, and the two Loss functions are explained above. Finally, there is a λ parameter, which determines the ratio of the importance of the new task to the new one in the training process, usually 1, so that both can be taken care of.
4.2 MAS
Memory Aware Synapses: Learning What (not) to forget. Different from the above two, this article calculates and updates the intensity of each parameter. This paper first presents a comparison with the above two methods:The author himself says that in fact his method is better everywhere. The cost is small, the field is wide, unsupervised learning can also be used, the capacity reserved for later tasks. Constant Memory: Whether the Memory used by the model is Constant, because only Constant is needed to prevent subsequent tasks from exploding. Problem Agnostic: Does the model solve only one Problem? The model should have a good performance and be applicable to all areas. On Pretrained: Given a pre-trained model, change it On top and add new tasks. Unlabelled Data: Can models learn unsupervised? This is a fatal problem, and it determines a lot of directions, whether the model can learn. Adaptive: Can the model leave enough space for each task? The main idea of this paper is to calculate the intensity ω of each parameter, so as to limit the update intensity of the parameter according to this intensity. Each time a new mission comes in and trains it, the change of the large ω parameter should be minimized in gradient descent because it was important for a previous mission and needs to be retained to avoid catastrophic forgetting. For parameters with small ω, we can perform gradient update on them with a larger amplitude to obtain better performance or accuracy on the new task. During specific training, intensity ω is added to Loss Function in the form of regular term.
4.2.1 Calculation of intensity
The idea is that if a parameter is changed and has a great influence on the model, then the strength of the parameter should be great. The authors regard the degree of change in this model as the strength of the parameter. First, assuming F is an approximate function of the real function propagating forward, and assuming δ is a perturbation parameter, then:On the left is the intensity of the change caused by the measurement parameter change, and on the right is the specific practice. In fact, it is quite natural to think that, to measure the intensity of change, gradient must be preferred. So, the gradient just needs a first order derivative.And strength ω :But given the multidimensional case, for each dimension calculation, it is not our computer professional style of save trouble, so the author USES a second paradigm square instead of the calculation method of the g function, so that it can all dimensions unified to a dimension, and after a calculation can obtain all the content.So, how should loss be calculated for the whole model? Back to the familiar loss: It will constrain the direction of the gradient according to the intensity.
5. Summary
In general, here is a look at the general approach to continuous learning. To sum up the continuous approach: Repetition of previous training data, integrating short-term memory into long-term memory, and placing the learned knowledge on a small number of neurons as much as possible. Instead, specific methods need to be selected according to the actual application to solve specific problems. Each method has its own applicability and application. Such methods as EWC are very general, but the disadvantage is obvious, that is, the overall intensity of change is consistent, so the model may not be optimal without differentiation. Like LWF, which is similar to model aggregation, knowledge distillation is used to better preserve previous tasks. And this also applies to model aggregation. MAS, which is more detailed and does not depend on data, is also a more general algorithm. The future direction of AI will also depend on continuous learning, rather than algorithms for offline training. Humans learn this way, and ai systems will increasingly be able to do the same. Imagine walking into an office for the first time and tripping over an obstacle. The next time you go to that spot, perhaps just a few minutes later, you’ll probably know to watch out for things you trip over. In short, this field is a broad one, and the problems it addresses are complex and diverse. There is the need to protect user privacy and can not reuse data, there is a heterogeneous model and so on. In short, we need to see more and understand more, so that we can continue to improve. Specific plans should also be adopted according to actual problems.
6. Reference materials
[1] Kirkpatrick J, Pascanu R, Rabinowitz N, et al. Overcoming catastrophic forgetting in neural networks[J]. Proceedings of the national academy of sciences, 2017, 114 (13) : 3521-3526. [2] Li Z, Hoiem D. Learning without forgetting[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(12): 2935-2947. [3] Aljundi R, Babiloni F, Elhoseiny M, et al. Memory aware synapses: Learning what (not) to forget[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 139-154.