Gradient cut

A common technique to reduce the problem of gradient explosion is to simply shear gradients during backpropagation so that they do not exceed a certain threshold (this is very useful for recursive neural networks; See chapter 14). This is known as gradient cropping. In general, people prefer batch standardization, but it’s still useful to know about gradient cropping and how to implement it.

In TensorFlow, the optimizer’s minimize() function is responsible for calculating gradients and applying them, so you must first call the optimizer’s compute_gradients() method and then use the clip_by_value() function to create an operation that clips gradients, Finally create an operation to apply the clipping gradient using the apply_gradients() method of the optimizer:


Copy the code

Threshold = 1.0

optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)

As usual, you will run this training_op during each training phase. It will calculate gradients, crop them to between -1.0 and 1.0, and apply them. Threhold is a hyperparameter that you can adjust.

Reuse the pre-training layer

It’s usually not a good idea to train a very large DNN from scratch. Instead, you should always try to find an existing neural network to do a task similar to the one you’re trying to solve, and then reuse the lower layers of that network: this is called transfer learning. Not only will this speed up training dramatically, but it will require less training data.

For example, suppose you could access DNN trained to classify images into 100 different categories, including animals, plants, vehicles, and everyday objects. You now want to train a DNN to classify a particular type of vehicle. The tasks are very similar, so you should try to reuse a portion of the first network (see Figure 11-4).



If the input image for the new task is not the same size as the input image used in the original task, a pre-processing step must be added to resize it to the expected size of the original model. More generally, transfer learning will work well if the input has similar low-level characteristics.

Reuse the TensorFlow model

If the original model is trained with TensorFlow, it can simply be restored and trained on a new task:

[...].  # construct the original modelCopy the code

Copy the code

with tf.Session() as sess:

saver.restore(sess, "./my_model_final.ckpt")

# continue training the model...

Complete code:


Copy the code

n_inputs = 28 * 28 # MNIST
n_hidden1 = 300
n_hidden3 = 50
n_hidden2 = 50 n_hidden4 = 50
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
n_outputs = 10
y = tf.placeholder(tf.int64, shape=(None), name="y")
with tf.name_scope("dnn"):
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4")
hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3") hidden5 = tf.layers.dense(hidden4, n_hidden5, activation=tf.nn.relu, name="hidden5")
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
logits = tf.layers.dense(hidden5, n_outputs, name="outputs") with tf.name_scope("loss"): loss = tf.reduce_mean(xentropy, name="loss") with tf.name_scope("eval"): correct = tf.nn.in_top_k(logits, y, 1)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), Name = "accuracy") learning_rate = 0.01 threshold = 1.0 optimizer = tf. Train. GradientDescentOptimizer (learning_rate) grads_and_vars = optimizer.compute_gradients(loss) for grad, var in grads_and_vars]
saver = tf.train.Saver()
training_op = optimizer.apply_gradients(capped_gvs)
init = tf.global_variables_initializer()


Copy the code

with tf.Session() as sess:

saver.restore(sess, "./my_model_final.ckpt")
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) y: mnist.test.labels})
save_path = saver.save(sess, "./my_new_model_final.ckpt")
print(epoch, "Test accuracy:", accuracy_val)



In general, however, you only need to reuse parts of the original model (as we’ll discuss). A simple solution is to configure Saver to restore only a portion of the variables in the original model. For example, the following code only restores hidden layers 1,2 and 3:


Copy the code

n_inputs = 28 * 28 # MNIST
n_hidden1 = 300 # reused
n_hidden3 = 50 # reused
n_hidden2 = 50 # reused n_hidden4 = 20 # new!
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
n_outputs = 10 # new!
with tf.name_scope("dnn"):
y = tf.placeholder(tf.int64, shape=(None), name="y")
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1") # reused
hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2") # reused
hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4") # new!
hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3") # reused logits = tf.layers.dense(hidden4, n_outputs, name="outputs") # new!
correct = tf.nn.in_top_k(logits, y, 1)
with tf.name_scope("loss"): xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits) loss = tf.reduce_mean(xentropy, name="loss") with tf.name_scope("eval"):
training_op = optimizer.minimize(loss)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy") with tf.name_scope("train"):
optimizer = tf.train.GradientDescentOptimizer(learning_rate)

[...].  # build new model with the same definition as before for hidden layers 1-3Copy the code

Copy the code

reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="hidden[123]") # regular expression
reuse_vars_dict = dict([(var.op.name, var) for var in reuse_vars])
restore_saver = tf.train.Saver(reuse_vars_dict) # to restore layers 1-3
restore_saver.restore(sess, "./my_model_final.ckpt")
init = tf.global_variables_initializer() saver = tf.train.Saver() with tf.Session() as sess: init.run()
for iteration in range(mnist.train.num_examples // batch_size): # not shown
for epoch in range(n_epochs): # not shown in the book X_batch, y_batch = mnist.train.next_batch(batch_size) # not shown
y: mnist.test.labels}) # not shown
sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) # not shown accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images, # not shown
save_path = saver.save(sess, "./my_new_model_final.ckpt")
print(epoch, "Test accuracy:", accuracy_val) # not shown






First we build a new model, making sure to copy hidden layers 1 through 3 of the original model. We also create a node to initialize all variables. We then get a list of all the variables we just created with Trainable = True (which is the default), and we keep only those variables whose range matches the regular expression hidden [123] (that is, we get all the variables in trainable hidden layers 1 through 3). Next, we create a dictionary that maps the name of each variable in the original model to the name in the new model (usually keeping the exact same name). Then, we create a Saver that will restore only these variables, and create another Saver to save the entire new model, not just layers 1 through 3. We then start a session and initialize all variables in the model, and then restore the variable values from layers 1 through 3 of the original model. Finally, we train the model on a new task and save it.

The more similar the tasks, the more layers you can reuse (starting at a lower level). For very similar tasks, you can try to keep all the hidden layers and replace only the output layer.

Reuse models from other frameworks

If the model is trained using another framework, you need to manually load the weights (for example, using Theano code if you are trained using Theano) and then assign them to the appropriate variables. This can be quite tedious. For example, the following code shows how to copy the weights and biases of the first hidden layer of a model trained using another framework:



Lower freeze

The lower layers of the first DNN may have learned to detect low-level features in images, which would be useful in both image classification tasks, so you can reuse the layers as is. When training a new DNN, it’s usually a good idea to “freeze” weights: if the lower weights are fixed, then the higher weights will be easier to train (because they don’t need to learn a moving target). The simplest solution to freeze the lower layers during training is to give the optimizer a list of variables to train, excluding variables from the lower layers:



The first line gets a list of all trainable variables in the hidden layers 3 and 4 and in the output layer. This leaves hidden variables in layers 1 and 2. Next, we supply this limited list of listable variables to the optimizer’s minimize() function. Dangdang! Layers 1 and 2 are now frozen: they do not change during training (often called frozen layers).

Cache freeze layer

Since the freeze layer does not change, the output of the uppermost freeze layer can be cached for each training instance. Since the training runs through the entire data set many times, this will give you a huge speed boost because each training instance only needs to go through the freeze layer once (instead of once per iteration). For example, you can run the entire training set first (assuming you have enough memory) :

hidden2_outputs = sess.run(hidden2, feed_dict={X: X_train})Copy the code

Then during training, instead of creating batches for the training instance, batches are created from the output of hidden layer 2 and provided to the training operation:



The last line runs the previously defined training operations (freeze layers 1 and 2) and outputs a batch of output for the second hidden layer (and the target of the batch). Because we hide the output of layer 2 from TensorFlow, it doesn’t evaluate it (or any nodes it depends on).

Adjust, remove or replace higher levels

The output layer of the original model should generally be replaced, because it is most likely useless for the new task, and there may not even be a suitable number of outputs for the new task.

Similarly, the higher hidden layers of the original model are unlikely to be as useful as the lower layers, because the high-level features that are most useful for the new task may be significantly different from those that were most useful for the original task. You need to find the right number of layers to reuse.

Try freezing all replicated layers first, then training the model and seeing how it performs. Then try unfreezing one or two of the higher hidden layers and let backpropagation adjust them to see if performance improves. The more training data you have, the more layers you can unfreeze.

If you still can’t get good performance and your training data is minimal, try removing the top hide layer and freezing all remaining hide layers again. You can iterate until you find the right number of layers to reuse. If you have enough training data, you can try replacing the top hidden layers, rather than throwing them away, or even adding more hidden layers.

Model Zoos

Where can you find a neural network that resembles the training for the task you want to solve? The first look is obviously in your own model directory. This is a good reason to save all the models and organize them so that you can easily retrieve them later. Another option is to search through the model zoo. Many people train machine learning models for a variety of tasks and release pre-training models to the public in good faith.

TensorFlow zoo in https://github.com/tensorflow/models has its own model. In particular, it includes most of the most advanced image classification networks such as VGG, Inception, and ResNet (see Chapter 13, check the Model/SLIM directory), including code, pre-training models, and tools to download popular image datasets.

Another popular model zoo is Caffe’s Model zoo. It also contains many computer vision models trained on various datasets (e.g., LeNet, AlexNet, ZFNet, GoogLeNet, VGGNet, start) (e.g., ImageNet, Places database, CIFAR10, etc.). Saumitro Dasgupta wrote a converter, can be in https://github.com/ethereon/caffetensorflow.

Unsupervised pre-training



Suppose you want to solve a complex task and you don’t have much marked training data, but unfortunately, you can’t find a similar training model for the task. Don’t lose all hope! First, you should certainly try to collect more labeled training data, but if that’s too difficult or too expensive, you can still do unsupervised training (see Figure 11-5). That is, if you have a lot of unlabeled training data, you can try to train layers one by one, starting at the lowest level and moving up, using unsupervised feature-detection algorithms such as restricted Boltzmann machines (RBM; See Appendix E) or autoencoders (see Chapter 15). Each layer is trained as the output of the previously trained layer (all layers except the trained layer are frozen). Once all layers have been trained in this way, the network can be fine-tuned using supervised learning, or back propagation.

It’s a fairly long and tedious process, but it usually works well. In fact, it was a technique used by Geoffrey Hinton and his team in 2006 that led to the revival of neural networks and the success of deep learning. Until 2010, unsupervised pre-training (often using RBM) was the standard for deep networks, and pure training DNN became more common only after the problem of gradient disappearance was alleviated. However, when you have a complex task to solve, unsupervised training (now usually using autoencoders instead of RBM) is still a good choice, there is no similar model to reuse, and there is very little labeled training data, but a lot of unlabeled training data. Another option is to propose a supervised task where you can easily collect large amounts of labeled training data and then use transfer learning, as described earlier. For example, if you want to train a model to recognize friends in pictures, you can download millions of faces on the Internet and train a classifier to detect whether the two faces are the same, then use the classifier to compare the new image to each photo of your friend.)

Pre-train on auxiliary tasks

The final option is to train the first neural network on a secondary task, where you can easily capture or generate the training data for the markers and then reuse the lower layers of the network for your actual task. The lower layers of the first neural network will learn feature detectors that may be reused by the second neural network.

For example, if you want to build a system to recognize faces, you might only have photos of a few people — obviously not enough to train a good classifier. Collecting hundreds of photos of each person would be impractical. But you can collect lots of photos of random people on the Internet and train the first neural network to detect whether two different photos belong to the same person. Such a network will learn face’s excellent feature detector, so reusing its lower layers will allow you to train a good face classifier with very little training data.

It is usually quite cheap to collect unlabelled training samples, but it is quite expensive to label them. A common technique in this case is to mark all training samples as “good” and then produce many new ones by breaking the good ones and marking them as “bad”. You can then train the first neural network to classify instances as good or bad. For example, you can download millions of sentences, mark them as “good,” then randomly change a word in each sentence and mark the resulting statement as “bad.” If a neural network could tell that “The dog Sleeps” are good sentences, but “The dog they” are bad, it might know quite a few languages. Reusing its lower layers can help with many language-processing tasks.

Another approach is to train the first network to output a score for each training instance and use a loss function to ensure that the score of a good instance is greater than the score of a bad instance by at least a certain margin. This is called maximum marginal learning.

Faster optimizer

Training a very large deep neural network can be very slow. So far, we’ve seen four ways to speed up training (and achieve better solutions) : apply good initialization strategies to connection weights, use good activation functions, use batch normalization, and reuse portions of the pretrained network. Another huge speed boost comes from using an optimizer that is faster than the normal gradient drop optimizer. In this section, we will introduce the most popular: momentum optimization, Nesterov acceleration gradient, AdaGrad, RMSProp, and finally Adam optimization.

Spoiler: The conclusion of this section is that you should almost always use Adam_optimization, so if you don’t care how it works, just replace your GradientDescentOptimizer with AdamOptimizer and skip to the next section! With such small changes, training is usually several times faster. However, Adam optimization does have three hyperparameters (plus learning rate) that can be adjusted. The defaults usually work fine, but if you need to adjust them, it might be helpful to know how they are implemented. Adam optimizations combine several ideas from other optimization algorithms, so it is useful to look at these algorithms first.

Momentum of the optimization

Imagine a bowling ball rolling down a gentle slope on a smooth surface: it will start slowly, but it will quickly reach its final speed (if there is some friction or air resistance). This is a very simple idea behind momentum optimization proposed by Boris Polyak in 1964. In contrast, ordinary gradient descent requires only small, regular descent steps along the slope, so it takes more time to reach the bottom.

Recall that gradient descent is just by subtracting the loss function directlyJ (theta)Relative to weightTheta.The gradient of omega times the learning rateetaTo update the weightsTheta.. Equation is:. It doesn’t care what the gradient was earlier. If the local gradient is small, it will be very slow.



Momentum optimization cares about the previous gradient: at each iteration, it adds the momentum vector M (multiplied by the learning rate η) to the local gradient, and updates the weight by simply subtracting the momentum vector (see Formula 11-4). In other words, the gradient is used for acceleration, not velocity. To simulate a certain friction mechanism and avoid excessive momentum, the algorithm introduces a new hyperparameter β, called momentum for short, which must be set between 0 (high friction) and 1 (no friction). A typical momentum value is 0.9.

You can easily verify that if the gradient remains constant, the final velocity (that is, the maximum size of the weight update) is equal to that gradient times the learning rate η times 1/(1-β). For example, if β = 0.9, then the final velocity is equal to the gradient of the learning rate times 10 times, so momentum optimization is 10 times faster than gradient descent! This makes momentum optimization much faster than gradient descent. In particular, we saw in Chapter 4 that when the input gauges have very different scales, the loss function looks like a slender bowl (see Figure 4-7). Gradient descent is fast, but it takes a long time to reach the bottom. Instead, momentum optimization rolls down the valley floor faster and faster until the bottom (optimal) is reached.



In deep neural networks that do not use batch standardization, higher levels tend to get inputs with different scales, so using momentum optimization can be of great help. It can also help roll over local optima.

Because of momentum, the optimizer may overshoot a little, then come back, overshoot again, and oscillate many times before settling at the minimum. That’s one of the reasons why there’s a little friction in the system: it eliminates these oscillations, which speeds up convergence.

Implementing momentum optimisation in TensorFlow is a simple matter: just replace GradientDescentOptimizer with MomentumOptimizer and lay back and make money!



One disadvantage of momentum optimization is that it adds another hyperparameter to adjust. However, a momentum value of 0.9 usually works well in practice and is almost always faster than gradient descent.

Nesterov acceleration gradient

A small variant of momentum optimization proposed by Yurii Nesterov in 1983 is almost always faster than ordinary momentum optimization. The idea of Nesterov momentum optimization or Nesterov Accelerated Gradient (NAG) is to measure the Gradient of the loss function not at a local position, but slightly forward in the direction of momentum (see Formula 11-5). The only difference from normal momentum optimization is that the gradient is measured at θ+βm instead of θ.



This small adjustment is feasible, because generally speaking, the momentum vector will be in the right direction (that is, the optimal direction), so using measured in the direction of the gradient a little bit more accurate, rather than using the original location of the gradient, as shown in figure 11-6 (including 1 ∇ representative at the starting point of theta measuring the loss function of the gradient, ∇2 represents the gradient at the point θ+βm).



As you can see, the Nesterov update is slightly closer to the optimal value. Over time, these small improvements add up and NAG ends up being much faster than conventional momentum optimization. Also, note that as momentum pushes weights across the valley, ▽1 continues to push across the valley, while ▽2 pushes back to the bottom of the valley. This helps to reduce oscillations, which leads to faster convergence.

NAG almost always speeds up training compared to conventional momentum optimization. To use it, simply set use_nesterov = True when creating MomentumOptimizer:



AdaGrad

Consider the slender bowl again: gradient descent rapidly from the steepest slopes, then slowly down to the bottom of the valley. It would be nice if the algorithm could detect this problem early and correct its orientation to point to the global best.

The AdaGrad algorithm accomplishes this by shrinking the gradient vector along the steepest dimension (see Formula 11-6) :



In short, this algorithm slows learning, but for steep sizes, it is faster than for sizes with gentle slopes. This is called adaptive learning rate. It helps to point the updated results more directly to global optimality (see Figure 11-7). Another advantage is that it does not require as much adjustment to the learning rate hyperparameter η.



AdaGrad often performed well for simple quadratic problems, but unfortunately it often stopped too early in training the neural network. The learning rate is reduced so much that the algorithm stops completely before reaching global optimization. So, even if TensorFlow has an AdagradOptimizer, you should not use it to train deep neural networks (although it may be effective for simple tasks like linear regression).

RMSProp

Although AdaGrad is a bit slower and never converges to global optimum, the RMSProp algorithm fixes this problem by only accumulating the gradients of the most recent iteration (rather than all gradients since the beginning of training). It is achieved by using exponential decay in the first step (see formula 11-7).



His decay rate beta is usually set at 0.9. Yes, it’s another new hyperparameter, but this default usually works fine, so you probably don’t need to adjust it at all.

As you might expect, TensorFlow has an RMSPropOptimizer class:



Except for very simple problems, this optimizer almost always performs better than AdaGrad. It also generally performs better than momentum optimization and Nesterov acceleration gradients. In fact, this was the preferred optimization algorithm for many researchers until Adam optimization came along.

Adam optimization

Adam, which stands for adaptive Moment estimation, combines the ideas of momentum optimization and RMSProp: just like momentum optimization, it tracks the exponential decay average of past gradients, and just like RMSProp, it tracks the exponential decay average of past square gradients (see Formula 11-8).



T represents the number of iterations (starting from 1).

If you just look at steps 1, 2, and 5, you’ll notice Adam’s similarity to momentum optimization and RMSProp. The only difference is that step 1 computes the average of exponential decays rather than the sum of exponential decays, but they are effectively equivalent except for a constant factor (the decay average is just 1 – β1 times the decay sum). Steps 3 and 4 are a technical detail: since M and S are initialized to 0, they tend to be biased toward 0 at the start of training, so these two steps will help improve M and S at the start of training.

Momentum decay hyperparameterBeta 1Normally initialized to 0.9, while scaling attenuates the hyperparameterBeta 2Usually initialized to 0.999. As mentioned earlier, smooth itemsEpsilon.Usually initialized to a very small number, for example. These are TensorFlow

AdamOptimizer class defaults, so you can simply use:



In fact, since Adam is an adaptive learning rate algorithm (such as AdaGrad and RMSProp), there is less adjustment to the learning rate hyperparameter η. You can often use the default value η= 0.001 to make Adam easier to use relative to gradient descent.

All optimization techniques discussed so far rely only on the first partial derivative (Jacobian). The optimization literature contains amazing algorithms based on second partial derivatives (Hessian matrices). Unfortunately, these algorithms are difficult to apply to deep neural networks because each output has n ^ 2 Heisen values (where n is the number of parameters), rather than only n Jacques ratios per output. Since DNN typically has tens of thousands of parameters, second-order optimization algorithms are often not even suitable for memory, and even when they do, computing hessian matrices is too slow.


The original article was published on June 24, 2018

Author: ApacheCN [翻译]

This article is from the Python Enthusiast Community, a cloud community partner. For more information, follow the Python Enthusiast Community.