Q: What can I learn from reading this blog? Answer: 1) Master the use of cumulative gradient information of historical training to alleviate the catastrophic forgetting problem of artificial neural network (deep learning); 2) Be able to write the optimizer corresponding to my newly proposed loss function in the FRAMEWORK of KERAS.
In my last blog post on Neural Networks catastrophic Forgetting, I tried to summarize the research on deep learning catastrophic forgetting. How many methods, and how to summarize it completely. But that blog post has a unifying idea for solving deep learning’s catastrophic forgetting problem: using historical information to awaken (suppress) neuronal parameters that are important to historical knowledge. This blog will be introduced at a previous review not mentioned in the paper [1], the authors of the paper thinks that historical training process contains the right amount of gradient and parameter adjustment in the value relative to the importance of the historical knowledge, namely historical gradient and the parameter adjustment quantity information can be used to measure the weight of the important degree, on this basis to suppress the important adjustment of weights, So as to avoid historical knowledge being covered and alleviate catastrophic forgetting. It’s available online, [1] Improved MultiTask Learning through Synaptic Intelligence and Continual Learning through Synaptic Intelligence “. It is possible that the author, after consideration, chose the latter as the final paper topic for ICML2017. Ha ha! The author also suffers from a common little scientific vanity. Obviously Multitask Learning is more important than Continual Learning, but Continual Learning is more eye-catching and forced into the top position. This is just a small noise, the paper has withstood the ICML review, the internal highlights are quite enough! Here we analyze this paper from four aspects: ideas, methods, algorithms (formulas) and procedures. Here we go!
@[TOC]
1. How to Learn from others
1.1 Definition of intelligence
Even if the author elevates his approach to intelligence, we need to know what intelligence is. In order to evaluate the author’s work.
Intelligence: The ability to learn or understand things or to deal with new or difficult situations.
Ancient Chinese thinkers regard intelligence and ability as two relatively independent concepts, and intelligence is the general term of intelligence and ability. The former is the basis of intelligence, that is, knowledge, the latter refers to the ability to obtain and apply knowledge to solve.
Still very empty….
To make the concept of intelligence more concrete, we need to clarify the relationship between intelligence, data, information and knowledge.
Data = record of facts. For example, such as the current running speed of the machine, heading Angle, distance and speed of obstacles, etc., environment three-dimensional point cloud; Information = data + meaning. For example, obstacles are approaching or away from themselves, and the ground is uneven; Knowledge = solutions and strategies to solve problems. For example: there are obstacles in front, then turn to the side or stop, need uneven ground, then slow down.
Intelligence is the ability to move from data to discovering, summarizing, and applying knowledge.
In this paper, the author discovers and summarizes knowledge (the importance of weights) from data (gradients and parameter adjustments during historical training) and applies it (to mitigate catastrophic forgetting). A strategy for updating weights from ordinary data is developed to alleviate the disastrous problems of deep learning. To sum up, the method of the paper is reasonably called an intelligence! It is safe to say that this intelligence corresponds to an advanced human intelligence: Learn to Learn.
1.2 Synaptic intelligence derived from biology
The main difference between artificial neural networks (ANN) and biological neural networks lies in the complexity of synaptic connections. ANN’s synapse (weight) is only described by a numerical quantity, while biological neuronal synapse is a complex dynamic system. Chemical synapses use molecular machinery, and each synapse is a very high dimensional state space, so biological synapses are highly plastic at both spatial and temporal scales. The complexity of biological synapses is thought to help consolidate memories. In the ANN field, there have been attempts to explain how increased synaptic complexity can benefit neural network models for supervised learning under multiple tasks.
Because simple, constant one-dimensional synapses have the problem of catastrophic forgetting, that is, when learning a new task, the neural network forgets a previously learned task. To do this, we augment the dimensions of the artificial neuron’s state space to three dimensions: the current parameter value, the last parameter value, and the importance of the first synapse (for the previous task). This importance measure can be calculated for each local synapse. This importance indicates the degree to which each synaptic change affects the global loss change.
The core idea of this paper is to measure the importance of each synapse during the training task. And when training new tasks, the importance is used to punish the change of importance weight, so as to avoid the historical memory being overwritten.
2. How do you measure the importance of synapses
The training process of neural networks (especially iterative BP algorithm) can be described by a trajectory θ(t)\theta(t)θ(t) in the parameter space. If the final point of parameter trajectory falls near the loss function LLL to reach the minimum value, it indicates that the training is successful.
For a neural network, it consists of several neurons and parameters of associated nerves. First, we assume that the parameters of the network have a slight change in time TTT δ(t)\delta(t)δ(t) (which simulates a training to update the parameters). Then we can calculate the values of the loss function before and after the parameter change. From this, we can obtain the influence degree of a slight adjustment of a parameter on the loss function. The change of the loss function can be approximated by the gradient g=∂L∂θg=\frac{\partial L}{\partial \theta}g=∂θ∂L. The specific change formula of the loss function is as follows: L (\ theta (t) + \ delta (t)), L (\ theta (t)) \ approx \ sum_ {k} g_k \ delta_k (t) (t) \ tag {1} the above formulas said the change of each parameter Delta k (t) = theta t ‘(t) \ delta_k (t) = \ theta’ _t delta k (t) (t) = theta t ‘(t) cause loss function variation for gk delta k (t) (t) g_k \ delta_k (t) (t) gk delta k (t) (t).
In order to calculate the sum of all the small changes that a parameter makes to the loss function by a path in the parameter space, we need to calculate the path integral of the parameter path over the gradient
\int_C g(\theta(t))d\theta=\int_{t_0}^{t_1}g(\theta(t))\theta'(t)dt=L(\theta(t_1))-L(\theta(t_0)) \tag{2}
In order to be able to calculate the influence of a single weight change on the loss function, Equation (2) can be changed into the following equation \int_{t^{\mu-1}}^{t^{\mu}}g(\theta(t))\theta'(t)dt=\sum_{k}\int_{t^{\mu-1}}^{t^{\mu}}g_k(\theta(t))\theta’_k(t)dt=-\sum_ K \omega_k^{\mu} \tag{3} −ωkμ-\omega_k^{\mu}−ωkμ is the effect of the adjustment of the KKK weight on the loss function.
In practice, we can approximate ωkμ\omega_k^{\mu}ωkμ by summating the product of the gradient and parameter updates generated during training.
The above mentioned sentient (literal) rational (formula) : small adjustment of parameters will cause the change of loss value, and the calculation formula is also given.
However, how to increase the information obtained above to the level of knowledge, that is, how to improve the multi-task learning ability of neural network (I still insist that multi-task learning cannot be equal to continuous learning). How can the knowledge of ωkμ\omega_k^{\mu}ωkμ be exploited to improve continual learning?
We assume that the path integral of the weights is ωkμ\omega_k^{\mu}ωkμ in the training task μ\muμ. ωkμ\omega_k^{\mu}ωkμ the greater the ωkμ\omega_k^{\mu}ωkμ during training indicates the greater the adjustment in task μ\muμ training, and indirectly indicates that ωkμ\omega_k^{\mu}ωkμ is important for task μ\muμ.
In the analogy with people, we make various changes in order to adapt to a new environment. Assuming that we eventually adapt to the environment, the one that changes the most is the one that matters the most to whether we can adapt to the environment.
Excessive is a bit abrupt, hahaha…..
The weight path integral during training can be used as an important basis to measure the importance of the weight relative to the task. The larger the integral value is, the greater the importance of the weight is for the task.
In the paper, the importance of the weight measured by the following formula: \ Omega_k ^ {\ mu} = < \ \ sum_ {v mu} \ frac {\ omega_ ^ {k} {n}} {(\ Delta_k) ^ v ^ 2 + \ xi} \ tag {4}
Among them, the Δ kv (TV) – theta equals theta k k (TV – 1) \ Delta_k ^ v = \ theta_k ^ v (t) – \ theta_k (t ^ {} v – 1) Δ kv (TV) – theta equals theta k k (TV – 1) is to ensure that the regularization has the same unit of measure and the loss function. The parameter ξ\xiξ is to prevent calculation bugs caused when δ KV \Delta_k^ V δ KV is close to zero. In training a certain task, omega mu k \ omega_k ^ {\ mu} k omega mu update, mu \ omega_k Ω k ^ {\ mu} mu and theta Ω k ^ \ hat {\ theta} theta ^ only at the end of the previous task and updated before the start of a new task, And ωk\ Omega_k^{\mu} ωk μ is reset to 0 after updating ωk\ Omega_k ωk μ.
Omega kμ\Omega_k^{\mu} OMEGA kμ we add the importance of weights to the cost function:
\hat{L_{\mu}}=L_{\mu}+c\sum_k \Omega_k^{\mu}(\hat{\theta}_k-\theta_k)^2 \tag{5}
Among them, the theta ^ k = theta mu k – 1 \ hat _k = {\ theta} \ theta_k ^ {} \ mu – 1 theta ^ k = theta mu k – 1, namely on end of the training the weights of a task.
The newly constructed cost function restrains the adjustment of the important weights according to the importance information of the weights relative to the historical task. The more important weights, the smaller changes are kept. (This is what the regularization term does.)
The author also carries on the theoretical analysis to the weight path integral in the paper (I don’t want to see, headache!). And compared with Fisher Information. Interested in copper shoes can go to Kangkang.
3. How do I read code — line by line
The author made the paper code open source on GitHub [Ganguli-Lab/Pathint/GitHub]. (github.com/ganguli-lab…
3.1 Keras custom optimizer
Let’s get familiar with the Backend module of Keras and see how you can implement a specialized optimizer for your loss functions. First of all, thanks to this blog post for the time-for-effect: Keras Gradient Cumulative Optimizer.
3.1.1 Superclass Optimizer
Some optimizers are defined in Keras, such as SGD, RMSprop, Adagrad, Adadelta, Adam, Adamax, Nadam, TFOptimizer, etc. They all have one parent in common, the Optimizer. Optimizer gives the skeleton of the Optimizer. Each Optimizer is just a different body: tall, short, fat, thin. The two main functions in the optimizer are __init__ and get_updates. Different optimizers pass different parameters to the initialization function, and we can add additional input parameters as needed. The update function varies from optimizer to optimizer. Subclass inheritance is mainly a variation of these two functions, as well as other additional function functions. The get_updates function in the superclass optimizer is empty, with only a function space reserved for subclass implementation. All of the specific optimizers described below cover only these two functions, along with some important additional function functions.
class Optimizer(object) :
"""Abstract optimizer base class. Note: this is the parent class of all optimizers, not an actual optimizer that can be used for training models. All Keras optimizers support the following keyword arguments: clipnorm: float >= 0. Gradients will be clipped when their L2 norm exceeds this value. clipvalue: float >= 0. Gradients will be clipped when their absolute value exceeds this value. """
def __init__(self, **kwargs) :
allowed_kwargs = {'clipnorm'.'clipvalue'}
for k in kwargs:
if k not in allowed_kwargs:
raise TypeError('Unexpected keyword argument '
'passed to optimizer: ' + str(k))
self.__dict__.update(kwargs)
self.updates = []
self.weights = []
@interfaces.legacy_get_updates_support
@K.symbolic
def get_updates(self, loss, params) :
raise NotImplementedError
def get_gradients(self, loss, params) :
grads = K.gradients(loss, params)
if any(x is None for x in grads):
raise ValueError('An operation has `None` for gradient. '
'Please make sure that all of your ops have a '
'gradient defined (i.e. are differentiable). '
'Common ops without gradient: '
'K.argmax, K.round, K.eval.')
if hasattr(self, 'clipnorm') and self.clipnorm > 0:
norm = K.sqrt(sum([K.sum(K.square(g)) for g in grads]))
grads = [clip_norm(g, self.clipnorm, norm) for g in grads]
if hasattr(self, 'clipvalue') and self.clipvalue > 0:
grads = [K.clip(g, -self.clipvalue, self.clipvalue) for g in grads]
return grads
def set_weights(self, weights) :
"""Sets the weights of the optimizer, from Numpy arrays. Should only be called after computing the gradients (otherwise the optimizer has no weights). # Arguments weights: a list of Numpy arrays. The number of arrays and their shape must match number of the dimensions of the weights of the optimizer (i.e. it should match the output of `get_weights`). # Raises ValueError: in case of incompatible weight shapes. """
params = self.weights
if len(params) ! =len(weights):
raise ValueError('Length of the specified weight list (' +
str(len(weights)) +
') does not match the number of weights ' +
'of the optimizer (' + str(len(params)) + ') ')
weight_value_tuples = []
param_values = K.batch_get_value(params)
for pv, p, w in zip(param_values, params, weights):
ifpv.shape ! = w.shape:raise ValueError('Optimizer weight shape ' +
str(pv.shape) +
' not compatible with '
'provided weight shape ' + str(w.shape))
weight_value_tuples.append((p, w))
K.batch_set_value(weight_value_tuples)
def get_weights(self) :
"""Returns the current value of the weights of the optimizer. # Returns A list of numpy arrays. """
return K.batch_get_value(self.weights)
def get_config(self) :
config = {}
if hasattr(self, 'clipnorm'):
config['clipnorm'] = self.clipnorm
if hasattr(self, 'clipvalue'):
config['clipvalue'] = self.clipvalue
return config
@classmethod
def from_config(cls, config) :
return cls(**config)
@property
def lr(self) :
# Legacy support.
return self.learning_rate
Copy the code
3.1.2 A simple optimizer instance SGD
Stochastic gradient descent is the simplest optimizer. The entry parameters of the initialization function are the learning rate, the reduction of the learning rate, and the small constant to prevent calculation bugs.
- Initialization function
def __init__(self, learning_rate=0.01, **kwargs) :
self.initial_decay = kwargs.pop('decay'.0.0)
self.epsilon = kwargs.pop('epsilon', K.epsilon())
learning_rate = kwargs.pop('lr', learning_rate)
super(Adagrad, self).__init__(**kwargs)
with K.name_scope(self.__class__.__name__):
self.learning_rate = K.variable(learning_rate, name='learning_rate')
self.decay = K.variable(self.initial_decay, name='decay')
self.iterations = K.variable(0, dtype='int64', name='iterations')
Copy the code
In the update function, Listupdates store all the quantities that need to be updated, and the bottom layer is updated together. For example, in the function self.updates = [k.update_add (self.iterations, 1)] self.updates. Append (k.update (p, new_p)), you can update the number of iterations and weights respectively. We don’t care how it does the update, just put every update in the list.
- Get_updates function
def get_updates(self, loss, params) :
grads = self.get_gradients(loss, params)
shapes = [K.int_shape(p) for p in params]
accumulators = [K.zeros(shape, name='accumulator_' + str(i))
for (i, shape) in enumerate(shapes)]
self.weights = [self.iterations] + accumulators
self.updates = [K.update_add(self.iterations, 1)]
lr = self.learning_rate
if self.initial_decay > 0:
lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
K.dtype(self.decay))))
for p, g, a in zip(params, grads, accumulators):
new_a = a + K.square(g) # update accumulator
self.updates.append(K.update(a, new_a))
new_p = p - lr * g / (K.sqrt(new_a) + self.epsilon)
# Apply constraints.
if getattr(p, 'constraint'.None) is not None:
new_p = p.constraint(new_p)
self.updates.append(K.update(p, new_p))
return self.
Copy the code
3.1.3 Define your own optimizer
Bloggers who write time-for-effect: Keras gradient Accumulative optimizer are excellent and also Push the code to Github, address: Bojone/Accum_Optimizer_for_keras. The program is very simple, easy to read, do not understand it varies from person to person. If you understand the TensorFlow diagram, it’s very easy to follow. The blogger also gave a sample program, and suggested that you run it to deepen your understanding. I ran for half an hour, also only ran 1/10, decisively stopped, verified the algorithm is effective, the program is right can, his computer run picture data set or a little difficult!
In order to achieve soft batch, i.e., batch_size=1 each time, only one data is input into the network to obtain the gradient value. However, in order to achieve the stability effect of training convergence of batch processing, only the gradient is calculated each time and then accumulated until the cumulative number reaches the preset soft BATch_size size. Then the accumulated gradient is used to update the weight once. Like the title of this blog post: Time for effect (space). This reduces the dependency on the GPU. This article does not evaluate the reasonableness of this approach, but is intended simply to familiarize yourself with how to define your own optimizer in KerAS. If we’re in machine learning, if we’re deep into it, we’re going to design our own optimizer, and if we don’t know how to implement it, we’re going to be passive. Write a neural network from scratch? Give yourself a break!
#! -*- coding: utf-8 -*-
from keras.optimizers import Optimizer
import keras.backend as K
class AccumOptimizer(Optimizer) :
""" Inherits the Optimizer class, wraps the original Optimizer, and implements gradient accumulation. Optimizer: an instance of an optimizer that supports all current keras optimizers; Steps_per_update: cumulative number of steps. [Inheriting Optimizer class] [Inheriting Optimizer class] wrapping the original optimizer to achieve a new corresponding optimizer of gradient accumulation. # Arguments optimizer: an instance of keras optimizer (supporting all keras optimizers currently available); steps_per_update: the steps of gradient accumulation # Returns a new keras optimizer. """
def __init__(self, optimizer, steps_per_update=1, **kwargs) :
super(AccumOptimizer, self).__init__(**kwargs)
self.optimizer = optimizer
with K.name_scope(self.__class__.__name__):
self.steps_per_update = steps_per_update
self.iterations = K.variable(0, dtype='int64', name='iterations')
self.cond = K.equal(self.iterations % self.steps_per_update, 0)
self.lr = self.optimizer.lr
self.optimizer.lr = K.switch(self.cond, self.optimizer.lr, 0.)
for attr in ['momentum'.'rho'.'beta_1'.'beta_2'] :if hasattr(self.optimizer, attr):
value = getattr(self.optimizer, attr)
setattr(self, attr, value)
setattr(self.optimizer, attr, K.switch(self.cond, value, 1 - 1e-7))
for attr in self.optimizer.get_config():
if not hasattr(self, attr):
value = getattr(self.optimizer, attr)
setattr(self, attr, value)
Overwrite the original gradient method and point to the cumulative gradient
# Cover the original get_gradients method with accumulative gradients.
def get_gradients(loss, params) :
return [ag / self.steps_per_update for ag in self.accum_grads]
self.optimizer.get_gradients = get_gradients
def get_updates(self, loss, params) :
self.updates = [
K.update_add(self.iterations, 1),
K.update_add(self.optimizer.iterations, K.cast(self.cond, 'int64')),]# Gradient Accumulation
self.accum_grads = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
grads = self.get_gradients(loss, params)
for g, ag in zip(grads, self.accum_grads):
self.updates.append(K.update(ag, K.switch(self.cond, ag * 0, ag + g)))
[inheriting updates of the original Optimizer]
self.updates.extend(self.optimizer.get_updates(loss, params)[1:])
self.weights.extend(self.optimizer.weights)
return self.updates
def get_config(self) :
iterations = K.eval(self.iterations)
K.set_value(self.iterations, 0)
config = self.optimizer.get_config()
K.set_value(self.iterations, iterations)
return config
Copy the code
In order to realize soft Batch training by accumulating single gradients, the optimizer variable Optimizer (Adam is used by bloggers) and cond parameter are added to the initialization function. Cond indicates whether the current step has reached the predetermined cumulative batCH_size, true if so, false otherwise. In the initialization function we also define a new function to get gradients (since we are calculating by cumulative gradients here) and assign the new function to self.optimizer.get_gradients, That is, the get_gradients function in the Adam class implemented by Keras is redefined (below).
def get_gradients(loss, params) :
return [ag / self.steps_per_update for ag in self.accum_grads]
self.optimizer.get_gradients = get_gradients
Copy the code
In the update function, there is also a function to get the gradient. Self.optimizer.get_gradients is different from self.get_gradients. The former is the acquisition gradient function in the passed optimizer Adam, while the latter is the acquisition gradient function in the newly defined optimizer.
[Very important] Note: it is interesting that the newly defined optimizer parameter also contains the optimizer. The newly defined optimizer not only inherits the Optimizer from the class, but also takes as input the Adam optimizer, which also has the Optimizer as its parent. So there are two complete sets of optimizers in the new optimizer. Self.optimizer.get_gradients = self.get_gradients = self.optimizer.get_updates = self.get_updates When we read the program, we just need to pay attention to whether the function is preceded by the passed optimizer or the newly defined optimizer itself, so we don’t get confused.
In the initialization function, the variables self.cond and self.optimizer.lr are written into the TensorFlow diagram, and their values are automatically updated every time the corresponding variables change. In the get_updates function, the self.iterations variable is displayed as an update, so the values of the following variables are updated at each step. Note that self.cond determines whether the accumulation reaches soft batch_size, if so the learning rate is self.optimizer.lr, otherwise 0. This indirectly allows the optimizer to do no training during the accumulation period (the learning rate is 0), train once after the accumulation is full, and start the accumulation again from 0.
self.cond = K.equal(self.iterations % self.steps_per_update, 0)
self.lr = self.optimizer.lr
self.optimizer.lr = K.switch(self.cond, self.optimizer.lr, 0.)
Copy the code
In the update function of the new optimizer, all the variables that need to be updated are put into the list self.updates, including the number of iterations variable, the cumulative gradient, and Adam update parameter function. Note that the get_gradient function passed into the optimizer Adam and the learning rate LR have been redefined in the initialization function. Take the learning rate, sometimes it’s zero, sometimes it’s normal meaningful learning rate. The function that gets the gradient is also the function that gets the average cumulative gradient newly defined by the initialization function.
self.updates = [
K.update_add(self.iterations, 1),
K.update_add(self.optimizer.iterations, K.cast(self.cond, 'int64')),]for g, ag in zip(grads, self.accum_grads):
self.updates.append(K.update(ag, K.switch(self.cond, ag * 0, ag + g)))
self.updates.extend(self.optimizer.get_updates(loss, params)[1:)Copy the code
These programs will be easier to read if you understand the concepts of TensorFlow’s diagrams. If you don’t understand tf’s concept of a compute graph, you can think of the entire program as a static top-to-bottom directed graph, with the node output as input to the node at the next level. Once the value of a node in the graph is updated, the value of the node whose ancestor node is updated. The get_update function returns list updates of the nodes that need to be updated, and the nodes associated with these updated nodes are automatically updated. So this is the graph for TF.
3.2 Pathint | dead simple
All of this is just a buffer, just to see the monster below a little bit more confident. How difficult! This program makes the world look gray to me. Read this program for three days and read it five or six times from start to finish. I know every letter, but I don’t understand how this class can complete the method described in the paper. The final formula in the paper is two: 1) a new loss function; 2) Weight importance ω\omegaω. I can’t understand the program. But it’s really easy to go back and read it now. Those key points go through, we’re good. The key points blocking me are: 1) The concept of tensorflow calculation graph is not clear; 2) Carelessness derived from confusion. The former is enough for me to understand the program through the analysis of the above examples. Now I will be careless places listed, for everyone to refer to, so as not to retrace 18 around.
# Copyright (c) 2017 Ben Poole & Friedemann Zenke
# MIT License -- see LICENSE for details
#
# This file is part of the code to reproduce the core results of:
# Zenke, F., Poole, B., and Ganguli, S. (2017). Continual Learning Through
# Synaptic Intelligence. In Proceedings of the 34th International Conference on
# Machine Learning, D. Precup, and Y.W. Teh, eds. (International Convention
# Centre, Sydney, Australia: PMLR), pp. 3987-3995.
# http://proceedings.mlr.press/v70/zenke17a.html
#
"""Optimization algorithms."""
import tensorflow as tf
import numpy as np
import keras
from keras import backend as K
from keras.optimizers import Optimizer
from keras.callbacks import Callback
from pathint.utils import extract_weight_changes, compute_updates
from pathint.regularizers import quadratic_regularizer
from collections import OrderedDict
class KOOptimizer(Optimizer) :
"""An optimizer whose loss depends on its own updates."""
def _allocate_var(self, name=None) :
return {w: K.zeros(w.get_shape(), name=name) for w in self.weights}
def _allocate_vars(self, names) :
#TODO: add names, better shape/init checking
self.vars = {name: self._allocate_var(name=name) for name in names}
def __init__(self, opt, step_updates=[], task_updates=[], init_updates=[], task_metrics = {}, regularizer_fn=quadratic_regularizer,
lam=1.0, model=None, compute_average_loss=False, compute_average_weights=False, **kwargs) :
"""Instantiate an optimzier that depends on its own updates.
Args:
opt: Keras optimizer
step_updates: OrderedDict or List of tuples
Contains variable names and updates to be run at each step:
(name, lambda vars, weight, prev_val: new_val). See below for details.
task_updates: same as step_updates but run after each task
init_updates: updates to be run before using the optimizer
task_metrics: list of names of metrics to compute on full data/unionset after a task
regularizer_fn (optional): function, takes in weights and variables returns scalar
defaults to EWC regularizer
lam: scalar penalty that multiplies the regularization term
model: Keras model to be optimized. Needed to compute Fisher information
compute_average_loss: compute EMA of the loss, default: False
compute_average_weights: compute EMA of the weights, default: False
Variables are created for each name in the task and step updates. Note that you cannot
use the name 'grads', 'unreg_grads' or 'deltas' as those are reserved to contain the gradients
of the full loss, loss without regularization, and the weight updates at each step.
You can access them in the vars dict, e.g.: oopt.vars['grads']
The step and task update functions have the signature:
def update_fn(vars, weight, prev_val):
'''Compute the new value for a variable.
Args:
vars: optimization variables (OuroborosOptimzier.vars)
weight: weight Variable in model that this variable is associated with.
prev_val: previous value of this varaible
Returns:
Tensor representing the new value'''
You can run both task and step updates on the same variable, allowing you to reset
step variables after each task.
"""
super(KOOptimizer, self).__init__(**kwargs)
if not isinstance(opt, keras.optimizers.Optimizer):
raise ValueError("opt must be an instance of keras.optimizers.Optimizer but got %s"%type(opt))
if not isinstance(step_updates, OrderedDict):
step_updates = OrderedDict(step_updates)
if not isinstance(task_updates, OrderedDict): task_updates = OrderedDict(task_updates)
if not isinstance(init_updates, OrderedDict): init_updates = OrderedDict(init_updates)
# task_metrics
self.names = set().union(step_updates.keys(), task_updates.keys(), task_metrics.keys())
if 'grads' in self.names or 'deltas' in self.names:
raise ValueError("Optimization variables cannot be named 'grads' or 'deltas'")
self.step_updates = step_updates
self.task_updates = task_updates
self.init_updates = init_updates
self.compute_average_loss = compute_average_loss
self.regularizer_fn = regularizer_fn
# Compute loss and gradients
self.lam = K.variable(value=lam, dtype=tf.float32, name="lam")
self.nb_data = K.variable(value=1.0, dtype=tf.float32, name="nb_data")
self.opt = opt
#self.compute_fisher = compute_fisher
#if compute_fisher and model is None:
# raise ValueError("To compute Fisher information, you need to pass in a Keras model object ")
self.model = model
self.task_metrics = task_metrics
self.compute_average_weights = compute_average_weights
def set_strength(self, val) :
K.set_value(self.lam, val)
def set_nb_data(self, nb) :
K.set_value(self.nb_data, nb)
def get_updates(self, params,loss,model=None) :
self.weights = params
# Allocate variables
with tf.variable_scope("KOOptimizer"):
self._allocate_vars(self.names)
#grads = self.get_gradients(loss, params)
# Compute loss and gradients
self.regularizer = 0.0 if self.regularizer_fn is None else self.regularizer_fn(params, self.vars)
self.initial_loss = loss
self.loss = loss + self.lam * self.regularizer
with tf.variable_scope("wrapped_optimizer") : self._weight_update_op, self._grads, self._deltas = compute_updates(self.opt, self.loss, params) wrapped_opt_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,"wrapped_optimizer")
self.init_opt_vars = tf.variables_initializer(wrapped_opt_vars)
self.vars['unreg_grads'] = dict(zip(params, tf.gradients(self.initial_loss, params)))
# Compute updates
self.vars['grads'] = dict(zip(params, self._grads))
self.vars['deltas'] = dict(zip(params, self._deltas))
# Keep a pointer to self in vars so we can use it in the updates
self.vars['oopt'] = self
# Keep number of data samples handy for normalization purposes
self.vars['nb_data'] = self.nb_data
if self.compute_average_weights:
with tf.variable_scope("weight_emga") as scope:
weight_ema = tf.train.ExponentialMovingAverage(decay=0.99, zero_debias=True)
self.maintain_weight_averages_op = weight_ema.apply(self.weights)
self.vars['average_weights'] = {w: weight_ema.average(w) for w in self.weights}
self.weight_ema_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=scope.name)
self.init_weight_ema_vars = tf.variables_initializer(self.weight_ema_vars)
print("> > > > >")
K.get_session().run(self.init_weight_ema_vars)
if self.compute_average_loss:
with tf.variable_scope("ema") as scope:
ema = tf.train.ExponentialMovingAverage(decay=0.99, zero_debias=True)
self.maintain_averages_op = ema.apply([self.initial_loss])
self.ema_loss = ema.average(self.initial_loss)
self.prev_loss = tf.Variable(0.0, trainable=False, name="prev_loss")
self.delta_loss = tf.Variable(0.0, trainable=False, name="delta_loss")
self.ema_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=scope.name)
self.init_ema_vars = tf.variables_initializer(self.ema_vars)
# if self.compute_fisher:
# self._fishers, _, _, _ = compute_fishers(self.model)
# #fishers = compute_fisher_information(model)
# self.vars['fishers'] = dict(zip(weights, self._fishers))
# #fishers, avg_fishers, update_fishers, zero_fishers = compute_fisher_information(model)
def _var_update(vars, update_fn) :
updates = []
for w in params:
updates.append(tf.assign(vars[w], update_fn(self.vars, w, vars[w])))
return tf.group(*updates)
def _compute_vars_update_op(updates) :
# Force task updates to happen sequentially
update_op = tf.no_op()
for name, update_fn in updates.items():
with tf.control_dependencies([update_op]):
update_op = _var_update(self.vars[name], update_fn)
return update_op
self._vars_step_update_op = _compute_vars_update_op(self.step_updates)
self._vars_task_update_op = _compute_vars_update_op(self.task_updates)
self._vars_init_update_op = _compute_vars_update_op(self.init_updates)
# Create task-relevant update ops
reset_ops = []
update_ops = []
for name, metric_fn in self.task_metrics.items():
metric = metric_fn(self)
for w in params:
reset_ops.append(tf.assign(self.vars[name][w], 0*self.vars[name][w]))
update_ops.append(tf.assign_add(self.vars[name][w], metric[w]))
self._reset_task_metrics_op = tf.group(*reset_ops)
self._update_task_metrics_op = tf.group(*update_ops)
# Each step we update the weights using the optimizer as well as the step-specific variables
self.step_op = tf.group(self._weight_update_op, self._vars_step_update_op)
self.updates.append(self.step_op)
# After each task, run task-specific variable updates
self.task_op = self._vars_task_update_op
self.init_op = self._vars_init_update_op
if self.compute_average_weights:
self.updates.append(self.maintain_weight_averages_op)
if self.compute_average_loss:
self.update_loss_op = tf.assign(self.prev_loss, self.ema_loss)
bupdates = self.updates
with tf.control_dependencies(bupdates + [self.update_loss_op]):
self.updates = [tf.group(*[self.maintain_averages_op])]
self.delta_loss = self.prev_loss - self.ema_loss
return self.updates#[self._base_updates
def init_task_vars(self) :
K.get_session().run([self.init_op])
def init_acc_vars(self) :
K.get_session().run(self.init_ema_vars)
def init_loss(self, X, y, batch_size) :
pass
#sess = K.get_session()
#xi, yi, sample_weights = self.model.model._standardize_user_data(X[:batch_size], y[:batch_size], batch_size=batch_size)
#sess.run(tf.assign(self.prev_loss, self.initial_loss), {self.model.input:xi[0], self.model.model.targets[0]:yi[0], self.model.model.sample_weights[0]:sample_weights[0], K.learning_phase():1})
def update_task_vars(self) :
K.get_session().run(self.task_op)
def update_task_metrics(self, X, y, batch_size) :
# Reset metric accumulators
n_batch = len(X) // batch_size
sess = K.get_session()
sess.run(self._reset_task_metrics_op)
for i in range(n_batch):
xi, yi, sample_weights = self.model._standardize_user_data(X[i * batch_size:(i+1) * batch_size], y[i*batch_size:(i+1)*batch_size], batch_size=batch_size)
sess.run(self._update_task_metrics_op, {self.model.input:xi[0], self.model.targets[0]:yi[0], self.model.sample_weights[0]:sample_weights[0]})
def reset_optimizer(self) :
"""Reset the optimizer variables"""
K.get_session().run(self.init_opt_vars)
def get_config(self) :
raise ValueError("Write the get_config bro")
def get_numvals_list(self, key='omega') :
""" Returns list of numerical values such as for instance omegas in reproducible order """
variables = self.vars[key]
numvals = []
for p in self.weights:
numval = K.get_value(tf.reshape(variables[p],(-1,)))
numvals.append(numval)
return numvals
def get_numvals(self, key='omega') :
""" Returns concatenated list of numerical values such as for instance omegas in reproducible order """
conc = np.concatenate(self.get_numvals_list(key))
return conc
def get_state(self) :
state = []
vs = self.vars
for key in vs.keys():
if key=='oopt': continue
v = vs[key]
for p in v.values():
state.append(K.get_value(p)) # FIXME WhyTF does this not work?
return state
def set_state(self, state) :
c = 0
vs = self.vars
for key in vs.keys():
if key=='oopt': continue
v = vs[key]
for p in v.values():
K.set_value(p,state[c])
c += 1
Copy the code
Initialization of the incoming parameters is not carefully looked at, thinking that step_updates, task_updates, etc. are an empty list, waiting to be initialized, or later used in the append element. After several searches, it is found that the value is not passed in from the beginning, but the parameters related to the task and each step are updated separately during training. There is nothing. How does the program succeed in training (I run through the program before reading the program, and the effect is fine)? Wonder.
def __init__(self, opt, step_updates=[], task_updates=[], init_updates=[], task_metrics = {}, regularizer_fn=quadratic_regularizer,
lam=1.0, model=None, compute_average_loss=False, compute_average_weights=False, **kwargs) :
Copy the code
Just yesterday, I noticed that when the sample program instantiates the newly defined optimizer, what parameters are passed in the following **protocol?
protocol_name, protocol = protocols.PATH_INT_PROTOCOL(omega_decay='sum', xi=xi)
oopt = KOOptimizer(opt, model=model, **protocol)
Copy the code
I’ve been ignoring where it is. It’s easy to find after you’ve paid special attention to it. It is defined as follows (we can find the weight importance index ω\omegaω in protocol) :
PATH_INT_PROTOCOL = lambda omega_decay, xi: (
'path_int[omega_decay=%s,xi=%s]'%(omega_decay,xi),
{
'init_updates': [('cweights'.lambda vars, w, prev_val: w.value() ),
],
'step_updates': [('grads2'.lambda vars, w, prev_val: prev_val -vars['unreg_grads'][w] * vars['deltas'][w] ),
],
'task_updates': [('omega'.lambda vars, w, prev_val: tf.nn.relu( ema(omega_decay, prev_val, vars['grads2'][w]/((vars['cweights'][w]-w.value())**2+xi)) ) ),
#('cached_grads2', lambda vars, w, prev_val: vars['grads2'][w]),
#('cached_cweights', lambda vars, w, prev_val: vars['cweights'][w]),
('cweights'.lambda opt, w, prev_val: w.value()),
('grads2'.lambda vars, w, prev_val: prev_val*0.0)],'regularizer_fn': quadratic_regularizer,
})
Copy the code
Then it connects to the tensorflow diagram, and everything works. This is not a difficult program, but the premise is to understand the concept of Tensorflow calculation diagram, and careful. However, the calculation chart has spent my patience almost, and so understand the calculation chart, and become careless. Alas! I’m so hard! Too hard!
Here is the end result of running the author’s sample program using the Minist dataset.
4. To summarize
This paper analyzes the thesis [1] from four levels: idea, method, algorithm (formula) and program. At the beginning of the article I spit a little under the author’s thesis title modification.Multitask Learning
thanContinual Learning
More consistent with the actual content of the paper. But the author chose something that looked more advancedContinual Learning
. First of all, acknowledge the quality of the paper methodology, how bad can an ICML paper be, and if I thought it was bad, would I spend so much time reading it closely? Let alone writing this blog! I took the topic as a specific issue, but thought it was worth paying attention to, and giving a less-than-perfect label to a favorite paper. Why do humans spend so much energy defining and classifying? In order to minimize the entropy of information transfer. We can deliver enough information and enough accuracy to others in the simplest possible terms. I think the better the paper, the more attention should be paid to this issue, because the audience is large. In terms of strict definition, I personally believe that multitasking learning cannot be called continuous learning, especially the training mode required by the method in paper [1]. Continuous learning requires the ability to learn new data. The method in this paper is just the ability to learn new tasks. Inability to learn from new data. For example, given that task 1 has already been learned, the method in paper [1] allows the AI to continue learning task 2 without too much catastrophic forgetting and basically retain the task knowledge previously learned. However, there is new data that is relevant to Task 1 now, and that data is a new case that needs to be learned as a scenario that was not previously considered for Task 1. In this case, let it feed directly into the neural network training, right? This is difficult because the method does not take into account the situation described above. Existing approaches to mitigate the catastrophic forgetting problem of deep learning all suffer from this shortcoming and sidestep this problem. Avoid it, use a multi-task learning as a topic, why must use continuous learning, lifelong learning, incremental learning and methods do not match the topic? The screenshots below are the notes I took when I began to read the paper [1]. At first I tried to take good notes, but as I wrote, the wind changed and I made a noise. Looking back, I feel funny. I have to say that the method is good and worth learning. But the topic is easy to make people think that such learning is continuous learning, is lifelong learning, and form a preconceived concept, but think that this is a theorem, truth! I think this is not good.The above is only a personal point of view, welcome to leave a message, refute, exchange!
[1]F. Zenke, B. Poole, and S. Ganguli, “Continual Learning Through Synaptic Intelligence,” arXiv:1703.04200 [cs, q-bio, stat], Mar. 2017.