Parameter updating methods in deep learning must be very clear to everyone — SGD, Adam, etc., and there are extensive discussions on which is better or worse. However, have you paid special attention to the attenuation strategy of learning rate?

When training neural network, learning rate is used to control the updating speed of parameters. When the learning rate is small, the updating speed of parameters will be greatly reduced. When the learning rate is large, it will make the search process oscillation, resulting in the parameter hovering near the optimal value. Therefore, learning rate attenuation is introduced in the training process, so that the learning rate gradually decreases with the training.

Learning rate attenuation policy file in tensorflow/tensorflow/python/training/learning_rate_decay py, A call to a function like TF.train. Exponential_decay will do.

The article directories

Quadratically polynomial_decay 1.1.exponential_decay 1.2.piecewise_constant 1.3. theoreial_decay 1.4.natural_exp_decay 1.5.inverse_time_decay 2. Cosine based decay 2.1.cosine_decay 2.2.cosine_decay_restarts 2.3.linear_cosine_decay 2.4.noisy_linear_cosine_decay 3. Custom 3.1. Auto_learning_rate_decay 4. summary

Below, I’ll take a look at each lr attenuator one by one in IPython

1. Exponential attenuation

The following implementations are based on exponential decay. Personally, I understand that the problem lies in the rapid decline of LR at the beginning, which may lead to rapid convergence to the local minimum in complex problems without a good exploration of the parameter space within a certain range.

1. Exponential_decay

exponential_decay(learning_rate, global_step, decay_steps, decay_rate,
                  staircase=False, name=None)
Copy the code

Exponential LR attenuation method is the most common attenuation method and is widely used in a large number of models. Parameters:

  • Learning_rate: indicates the initial learning rate.
  • Global_step: non-negative number of global steps used in attenuation calculation. Used to calculate the decay exponent step by step.
  • Decay_steps: Number of decayed steps. The value must be positive. Determines the decay period.
  • Decay_rate: indicates the attenuation rate.
  • Staircase: If True, the learning rate decays at discrete intervals (the same learning rate over time or within the same eproch); If False, it is standard exponential decay.
  • Name: The name of the operation. The default is ExponentialDecay. (Optional)

The learning rate of exponential decay can be calculated as follows:

decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)  
Copy the code

Advantages: simple and direct, fast convergence.

Example, stepwise versus exponential decay:

# coding: UTF-8 #exponential_decay import matplotlib.pyplot as PLT import tensorflow as tf #global_step = tf.Variable(0, name='global_step', trainable=False) y = [] z = [] N = 200 with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for global_step in range(N): Exponential_decay (learning_rate=0.5, global_step=global_step, decay_steps=10, Decay_rate2 = tf.train. Exponential_decay (Decay_rate =0, standard =True) # decay_rate=0, standard =True) Global_step = global_step, decay_steps = 10, decay_rate = 0.9, staircase=False) lr1 = sess.run([learing_rate1]) lr2 = sess.run([learing_rate2]) y.append(lr1[0]) z.append(lr2[0]) x = Range (N) FIG = plt.figure() ax = add_subplot(111) ax.set_ylim([0, 0.55]) plt.plot(x, y, 'r-', linewidth=2) plt.plot(x, z, 'g-', linewidth=2) plt.title('exponential_decay') ax.set_xlabel('step') ax.set_ylabel('learing rate') plt.show()Copy the code

                                                              

Exponential_decay example, where the red lines are staircase=False and the blue lines are staircase=True

The advantages of this attenuation method are fast convergence, simple and direct.

1.2.Piecewise constant attenuation (Piecewise_constant)

Piecewise constant attenuation is to set different constant values on the defined interval respectively as the initial value of the learning rate and the value of subsequent attenuation.

The function prototype

piecewise_constant(x, boundaries, values, name=None)

Parameters:

  • X: 0-D scalar Tensor.
  • Boundaries: you will have a tensor or a list.
  • Values: Specifies the value of the defined interval.
  • Name: operation name. The default value is PiecewiseConstant.

The piecewise constant decay method is similar to the step decay method in exponential_decay, but the values of each stage are self-defined.

Boundaries =[step_1, step_2… step_n] define the number of lr attenuations at which values=[val_0, val_1, val_2… Val_n] defines the initial value of LR and the specific value of subsequent attenuation. It is important to note that values should be one dimension longer than boundaries.

Features This method is helpful for users to finely tune parameters for different tasks and reduce the learning rate of any value after any step size.

Code examples:

# piecewise_constant import matplotlib.pyplot as PLT import tensorflow as tf #global_step = tf.Variable(0, 0) Name ='global_step', trainable=False) boundaries = [10, 20, 30] learing_rates = [0.1, 0.07, 0.025, Y = [] N = 40 with tf.session () as sess: sess.run(tf.global_variables_initializer()) for global_step in range(N): learing_rate = tf.train.piecewise_constant(global_step, boundaries=boundaries, values=learing_rates) lr = sess.run([learing_rate]) y.append(lr[0]) x = range(N) plt.plot(x, y, 'r-', linewidth=2) plt.title('piecewise_constant') plt.show()Copy the code

 

                                                                  

Figure 2. Piecewise_constant example

This method helps users to finely tune parameters for different tasks and reduce the learning rate of any value after any step size.

1.3 Polynomial_decay

Polynomial_decay (Learning_rate, global_step, decay_steps, end_Learning_rate =0.0001, power=1.0, cycle=False, name=None) while polynomial_decay(Learning_rate =0.0001, power=1.0, cycle=False, name=None)Copy the code

Parameters:

  • Learning_rate: indicates the initial learning rate.
  • Global_step: non-negative number of global steps used in attenuation calculation.
  • Decay_steps: Number of decayed steps. The value must be positive.
  • End_learning_rate: indicates the lowest final learning rate.
  • Power: A power of the polynomial, default to 1.0 (linear).
  • Cycle: Whether the learning rate rises again after falling.
  • Name: The name of the operation, which defaults to PolynomialDecay.

The function decays the initial learning rate (learning_rate) with a given decay_steps to the specified learning rate (end_learning_rate) using polynomial decay.

The learning rate of polynomial attenuation can be calculated by:

Global_step = min(global_step, decay_steps) decayed_learning_rate = (learning_rate - end_learning_rate) * (1 - global_step / decay_steps) ^ (power) + End_learning_rate # If cycle=True decay_steps = decay_steps * ceil(global_step/decay_steps) decayed_learning_rate = (learning_rate - end_learning_rate) * (1 - global_step / decay_steps) ^ (power) + end_learning_rateCopy the code

Code examples:

Import matplotlib.pyplot as PLT import tensorflow as tf y = [] z = [] N = 200 #global_step = tf.Variable(0, name='global_step', trainable=False) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for global_step in range(N): # cycle = False learing_rate1 = tf. Train. Polynomial_decay (learning_rate = 0.1, global_step = global_step, decay_steps = 50, End_learning_rate = 0.01, power = 0.5, Cycle = False) # cycle = True learing_rate2 = tf. Train. Polynomial_decay (global_step = global_step learning_rate = 0.1, Decay_steps = 50, end_learning_rate = 0.01, power = 0.5, cycle=True) lr1 = sess.run([learing_rate1]) lr2 = sess.run([learing_rate2]) y.append(lr1[0]) z.append(lr2[0]) x = range(N) fig = plt.figure() ax = fig.add_subplot(111) plt.plot(x, z, 'g-', linewidth=2) plt.plot(x, y, 'r--', linewidth=2) plt.title('polynomial_decay') ax.set_xlabel('step') ax.set_ylabel('learing rate') plt.show()Copy the code

                                                                     

Figure 3. The polynomial_decay example, cycle=True, with the same colors

It can be seen that the learning rate reaches the minimum after decay_steps=50 iterations. At the same time, when cycle=False, the minimum value will remain unchanged after the learning rate reaches the preset minimum value. When cycle=True, the learning rate will increase instantly and then decrease;

The purpose of setting the learning rate to rise and fall back and forth in polynomial attenuation is as follows: in order to prevent the network parameters from falling into local optimum due to the too small learning rate in the later stage of neural network training, the learning rate is increased, which may make it jump out of local optimum;

1.4 Natural exponential Decay (natural_exp_decay)

natural_exp_decay(learning_rate, global_step, decay_steps, decay_rate,
                  staircase=False, name=None)
Copy the code

parameter

  • Learning_rate: indicates the initial learning rate.
  • Global_step: non-negative number of global steps used in attenuation calculation.
  • Decay_steps: decays the number of steps.
  • Decay_rate: indicates the attenuation rate.
  • Staircase: If True, it is a discrete decay of stairs (that is, the same rate of learning over time or within the same eproch); If False, it is standard attenuation.
  • Name: The name of the operation. The default is ExponentialTimeDecay.

Natural_exp_decay is in the same form as exponential_decay, except that the base from which the natural index fell isType.

Exponential_decay: decayed_learning_rate = learning_rate * decay_rate ^ (global_step/decay_steps) natural_exp_decay: decayed_learning_rate = learning_rate * exp(-decay_rate * global_step / decay_steps)

If staircase=True, the learning rate is set to a discrete value, each decay_steps iteration is updated once.

Code examples:

import matplotlib.pyplot as plt import tensorflow as tf #global_step = tf.Variable(0, name='global_step', trainable=False) y = [] z = [] w = [] N = 200 with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for global_step in range(N): # step decay learing_rate1 = tf.train. Natural_exp_decay (Learning_rate =0.5, global_step=global_step, decay_steps=10, Decay_rate =0.9, staircase=True) # Decay = decay_rate2 = tf.train. Natural_exp_decay (Decning_rate =0.5, Global_step = global_step, decay_steps = 10, decay_rate = 0.9, Staircase =False) # decay = tf.train. Exponential_decay (Learning_rate =0.5, global_step=global_step, Decay_steps = 10, decay_rate = 0.9, staircase=False) lr1 = sess.run([learing_rate1]) lr2 = sess.run([learing_rate2]) lr3 = sess.run([learing_rate3]) y.append(lr1[0]) z.append(lr2[0]) w.append(lr3[0]) x = range(N) fig = plt.figure() ax = fig.add_subplot(111) Ax. Set_ylim ([0, 0.55]) PLT. The plot (x, y, r -, our linewidth = 2) PLT. The plot (x, z, 'g -' our linewidth = 2) PLT. The plot (x, w, b -, linewidth=2) plt.title('natural_exp_decay') ax.set_xlabel('step') ax.set_ylabel('learing rate') plt.show()Copy the code

                                                                          

Figure 4. Comparison between Natural_exp_decay and exponential_decay, where the red line is natural_exp_decay, the blue line is the step curve of natural_exp_decay, and the green line is exponential_decay

It can be seen from the figure that the natural number index decreased much faster than exponential_decay, which is suitable for networks with fast convergence and easy training.

1.5Reciprocal decay (Inverse_time_decay)

inverse_time_decay(learning_rate, global_step, decay_steps, decay_rate,
                   staircase=False, name=None)
Copy the code

Parameters:

  • Learning_rate: indicates the initial learning rate.
  • Global_step: the number of global steps used for attenuation calculation.
  • Decay_steps: decays the number of steps.
  • Decay_rate: indicates the attenuation rate.
  • Staircase: Specifies whether discrete staircase decay is applied. (otherwise continuous)
  • Name: The name of the operation. Default: InverseTimeDecay.

Inverse_time_decay is the reciprocal decay formula shown below:

decayed_learning_rate = learning_rate / (1 + decay_rate * global_step / decay_step)

Code sample

import matplotlib.pyplot as plt import tensorflow as tf y = [] z = [] N = 200 #global_step = tf.Variable(0, name='global_step', trainable=False) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for global_step in range(N): Inverse_time_decay (learning_rate1 = tf.train. Inverse_time_decay (learning_rate=0.1, global_step=global_step, decay_steps=20, Decay_rate =0.2, staircase=True) # decay = decay_rate2 = tf.train. Inverse_time_decay (Learning_rate =0.1, Global_step = global_step, decay_steps = 20, decay_rate = 0.2, staircase=False) lr1 = sess.run([learing_rate1]) lr2 = sess.run([learing_rate2]) y.append(lr1[0]) z.append(lr2[0]) x = range(N) fig = plt.figure() ax = fig.add_subplot(111) plt.plot(x, z, 'r-', linewidth=2) plt.plot(x, y, 'g-', linewidth=2) plt.title('inverse_time_decay') ax.set_xlabel('step') ax.set_ylabel('learing rate') plt.show()Copy the code

                                                                                      

**** Figure 5. inverse_time_decay example

The above attenuation methods are similar, mainly based on exponential attenuation. Personally, I understand that the problem lies in the rapid decline of LR at the beginning, which may lead to rapid convergence to the local minimum in complex problems without a good exploration of the parameter space within a certain range

 

2. Cosine based attenuation

The following implementations are based on the cosine function.

2.1(cosine_decay)

Cosine_decay (Learning_rate, global_step, decay_steps, alpha=0.0, name=None)Copy the code

Parameters:

  • learning_rate: Indicates the initial learning rate.
  • global_step: The global number of steps used in attenuation calculations.
  • decay_steps: Attenuates steps.
  • alpha: Minimum learning rate (part of learning_rate).
  • name: The name of the operation. Default is CosineDecay

Cosine_decay is an LR decay strategy proposed just a year ago. The basic shape is cosine function. The method is realized based on the thesis: SGDR: Stochastic Gradient Descent with Warm Restarts

The calculation steps are as follows:

global_step = min(global_step, Decay_steps) cosine_decay = 0.5 * (1 + cos(PI * global_step/decay_steps)) decayed = (1-alpha) * cosine_decay + alpha  decayed_learning_rate = learning_rate * decayedCopy the code

The sample code

import matplotlib.pyplot as plt import tensorflow as tf y = [] z = [] N = 200 #global_step = tf.Variable(0, name='global_step', trainable=False) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for global_step in range(N): # decay learing_rate1 = tf.train. Cosine_decay (Learning_rate =0.1, global_step=global_step, decay_steps=150, Alpha =0.0) # decay learing_rate2 = tf.train. Cosine_decay (Learning_rate =0.1, global_step=global_step, decay_steps=150, Alpha =0.3) LR1 = sess.run([Learing_rate1]) Lr2 = Sess.run ([Learing_rate2]) y.append(LR1 [0]) z.ppend (lR2 [0]) x = range(N) fig = plt.figure() ax = fig.add_subplot(111) plt.plot(x, z, 'r-', linewidth=2) plt.plot(x, y, 'g-', linewidth=2) plt.title('cosine_decay') ax.set_xlabel('step') ax.set_ylabel('learing rate') plt.show()Copy the code

Alpha serves as baseline, ensuring that LR does not fall below a certain value. The effects of different alpha are as follows:

                                                                          

Figure 6. Cosine_decay example where alpha=0.3 for red lines and alpha=0.0 for blue lines

2.2 Restarts for cosine Decay (cosine_decay_restarts)

Restarts for cosine_decay_restarts(learning_rate, global_step, first_decay_steps, T_mul =2.0, m_mul=1.0, alpha=0.0, name=None)Copy the code

Parameters:

  • Learning_rate: Scalar float32 or float64 Tensor or Python number. Initial learning rate.
  • Global_step: The scalar INT32 or int64 Tensor or Python numbers. Global steps for attenuation calculations.
  • First_decay_steps: Scalar INT32 or int64 Tensor or Python number. Number of decay steps.
  • T_mul: Scalar float32 or float64 Tensor or Python number. Used to derive the number of iterations in the i-th cycle
  • M_mul: Scalar FLOAT32 or FLOAT64 Tensor or Python number. Used to derive the initial learning rate of the i-th cycle:
  • Alpha: scalar float32 or float64 Tensor or Python number. Minimum learning_rate value as part of learning_rate.
  • Name: String. Optional name of the operation. The default is’ SGDRDecay ‘.

Cosine_decay_restarts is the cycle version of cosine_decay. First_decay_steps refers to the number of steps that fell completely for the first time, t_mul means that the number of steps in each cycle will be multiplied by T_mul, and m_mul means that the initial LR at each cycle restart is m_mul times the initial value of the previous cycle.

Code sample

# coding:utf-8 import matplotlib.pyplot as plt import tensorflow as tf y = [] z = [] EPOCH = 100 global_step = tf.Variable(0, name='global_step', trainable=False) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for global_step in range(EPOCH): # restarts for learing_rate1 = tf.train.cosine_decay_restarts(learning_rate=0.1, global_step=global_step, Restarts for learning_rate2 = tf.train.cosine_decay_restarts(learning_rate=0.1, global_step=global_step, first_decay_steps=60) lr1 = sess.run([learing_rate1]) lr2 = sess.run([learing_rate2]) y.append(lr1) z.append(lr2) x = range(EPOCH) fig = plt.figure() ax = fig.add_subplot(111) plt.plot(x, y, 'r-', linewidth=2) plt.plot(x, z, 'b-', linewidth=2) plt.title('cosine_decay') ax.set_xlabel('step') ax.set_ylabel('learing rate') plt.legend(labels=['decay_steps=40', 'decay_steps=60'], loc='upper right') plt.show()Copy the code

                                          

Figure 7. Restarts example for cosine_decay_restarts, red lines T_MUl =2.0, m_MUl =0.5, and blue lines T_MUl =2.0, m_MUl =1.0

The decline of cosine function simulates the process of large LR finding potential area and then rapid convergence of small LR. In addition, the cycle effect brought by restart may increase 1-2 points.

2.3 Linear_cosine_decay

Linear_cosine_decay (Learning_rate, global_step, decay_steps, num_periods=0.5, alpha=0.0, beta=0.001, name=None)Copy the code

Parameters:

  • Learning_rate: indicates the initial learning rate.
  • Global_step: the number of global steps used for attenuation calculation.
  • Decay_steps: decays the number of steps.
  • Num_periods: Decreases the number of periods in the cosine.
  • Alpha: See calculation.
  • Beta: See calculations.
  • Name: The name of the operation. Default is LinearCosineDecay.

The reference of Linear_cosine_decay is Neural Optimizer Search with RL, which is mainly applied in the field of enhanced learning. I have not tried it. It can be seen that this method is also based on the cosine function attenuation strategy.

 

Figure 9. Linear_cosine_decay example

noisy_linear_cosine_decay

Noisy_linear_cosine_decay (Learning_rate, global_step, decay_steps, Initial_variance =1.0, variance_decay=0.55, Num_periods = 0.5, alpha = 0.0, beta = 0.001, name = None)Copy the code

2.4. Noisy_linear_cosine_decay

The linear cosine attenuation of noise is applied to the learning rate. The calculation method is the same as linear_cosine_decay

parameter

  • Learning_rate: indicates the initial learning rate.
  • Global_step: the number of global steps used for attenuation calculation.
  • Decay_steps: decays the number of steps.
  • Initial_variance: the initial variance of noise.
  • Variance_decay: The variance of decay noise.
  • Num_periods: Decreases the number of periods in the cosine.
  • Alpha: See calculation.
  • Beta: See calculations.
  • Name: The name of the operation. The default is NoisyLinearCosineDecay.
#! /usr/bin/python # coding:utf-8 import matplotlib.pyplot as plt import tensorflow as tf y = [] z = [] w = [] N = 200 global_step = tf.Variable(0, name='global_step', trainable=False) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for global_step in range(N): # cosine decay learing_rate1 = tf.train. Cosine_decay (Learning_rate =0.1, global_step=global_step, decay_steps=50, Alpha =0.5) # Linearcosine decay learing_rate2 = tf.train. Linear_cosine_decay (Learning_rate =0.1, global_step=global_step, Decay_steps = 50, num_periods = 0.2, alpha = 0.5, # noise linear cosine decay learing_rate3 = tf.train. Noisy_linear_cosine_decay (Learning_rate =0.1, global_step=global_step, Decay_steps =50, initial_variance=0.01, variance_decay=0.1, num_periods=0.2, alpha=0.5, Beta =0.2) lr1 = sess.run([learing_rate1]) lr2 = sess.run([learing_rate2]) lr3 = sess.run([learing_rate3]) y.append(lr1[0]) z.append(lr2[0]) w.append(lr3[0]) x = range(N) fig = plt.figure() ax = fig.add_subplot(111) plt.plot(x,  z, 'b-', linewidth=2) plt.plot(x, y, 'r-', linewidth=2) plt.plot(x, w, 'g-', linewidth=2) plt.title('cosine_decay') ax.set_xlabel('step') ax.set_ylabel('learing rate') plt.show()Copy the code

                           

 

3. The custom3.1. Auto_learning_rate_decay

Of course, you can also customize the learning rate attenuation strategy, such as setting the detector to monitor the valid loss or accuracy value. If the loss continues to decrease effectively within a certain period of time/ACC continues to increase effectively, lr will remain, otherwise it will decrease. The more loss increases/ACC decreases, the faster LR decreases, etc.

 

4. Summary

In my actual use, exponential_decay is the most commonly used one, but you can try cosine_decay_restarts for restarts, it will definitely bring you a surprise

reference

The mystery of Learning Rate Decay in Tensorflow

TensorFlow learning — Learning rate decay /learning rate decay

How to set the learning rate in TensorFlow

Learning rate of deep model training

TensorFlow learning — Learning rate decay /learning rate decay

Image Classification Training Skills Collection (Paper Notes)

Github.com/zsweet/blog…