Welcome to visitOldpan blog, share interesting news on ARTIFICIAL intelligence, and continue to brew quality articles on deep learning.

preface

Many people say that training a neural network is like alchemy; it’s hard to crack the black box. In fact, we can still maximize our training effect through a large number of skills during training, so as to help us achieve good accuracy of the task. These skills are an indispensable part of the training of neural network.

This article as far as possible to explain the training process of the need for a variety of small skills, there will be imperfect places, limited to the length of each point will not be very detailed, but these concepts are we need to master, keep these points in mind, in the process of training neural network can be handy, so that Alchemy no longer have no clue ~

Avoid overfitting

Overfitting refers to overfitting. Typically, the loss of training set is much less than that of verification set. Under fitting, the loss of training set is greater than that of verification set.

To be clear about the concept of far greater, if the training set loss is only a little more than the validation set loss, the same order of magnitude (e.g. 0.8 and 0.9) is not an over-fitting performance. The overfitting we usually encounter is a loss ratio of 0.8(training set loss) to 2.0(validation set loss), which is not of the same order of magnitude.

Dropout and overfitting

Dropout is similar to bagging ensemble and reducing variance. That is, voting reduces variability by voting. Typically we use dropout for the fully connected layer, but not for the convolution layer. But dropout is not for every situation, so don’t forget about dropout.

Dropout is generally suitable for the fully connected layer, while the convolutional layer does not need Dropout because its parameters are not very many. In addition, the generalization ability of the model will not be greatly affected. The diagram below:

We usually use the full connection layer at the beginning and end of the network, and hidden layers is the convolution layer in the network. So in general, we use high dropout probability at the fully connected layer and low or no dropout probability at the convolution layer.

See related discussion for further details

  • www.reddit.com/r/MachineLe…
  • Blog.csdn.net/stdcoutzyx/…

Is the dataset corrupted?

The stand or fall of data sets are generalization algorithm an important prerequisite for good or bad, we usually in the construction of deep learning task, the resulting data set does not generally is very perfect (both build and use the data of others, in the time of the large amount of data set, data set images is inevitable damage or the wrong one).

If there is a corrupted image, if we are not aware of it, generally an error will be reported when we use code to read it, for example: Raise IOError(” Image file is truncated) “.

There was such a situation in a Kaggle competition. There were 10W + image data in the training data, but more than 10 images were missing (the image size was significantly smaller than other normal images) or were directly damaged and could not be read. At this point, we need to write our own manual code to pick out those images that are wrong and can’t participate in training.

So what about filtering these images correctly?

  • Find features of corrupted images, such as unreadable images, and write programs to remove unreadable images
  • Find the missing content of the image, such image size is often smaller than the ordinary image, by filtering the file size is less than a certain threshold to remove

Of course, there are many other forms of data set corruption that need to be explored by ourselves. It is better not to start training directly when you get the data set at the beginning. You should carefully check whether there is any problem with your data set, which can avoid a lot of trouble in later training.

Mining is hard to do

In any deep learning task, we will encounter some “difficult” data, which is more difficult to identify than other ordinary data. Such an example of extremely easy to identify errors is called hard-negative.

Here’s an example.

For example, in the task of identifying a ship in a remote sensing image in Kaggle competition, the image set was cropped from a large remote sensing image, each image was 768*768 in size. In the simple classification of the image (only with or without a ship in the image), it was found in the validation that the most prone to identification errors were as follows:

After observing the characteristics of these most difficult images, we can take some measures to solve these problems for these difficult images. We first use the initial of positive and negative samples (usually is sample + and are a subset of the sample with the size of the negative samples) training classifier, and then use the trained classifier to classify the sample, the sort of negative samples error of the sample (hard negative) into the negative sample collection, training classifier and so on, Until a stop condition is reached (for example, classifier performance does not improve). The idea is to train the classifier to learn the hard-to-learn features, and to put it simply, practice makes perfect.

In fast-RCNN, this method is also used:

In fast-rCNN, images between IoU of Groud-truth [0.1, 0.5) are marked as negative examples, and example of [0, 0.1) is used for hard negative mining. N=2 pictures are generally input during training. 128 ROIs were selected, that is, 64 ROIs for each picture. For each picture, RoI was extracted at a ratio of 1:3, with a ratio of 1:3 for positive and negative samples, and 48 samples were extracted from negative cases.

See the two related articles

[17] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010.
[37] K. Sung and T. Poggio. Example-based learning for viewbased human face detection. Technical Report A.I. Memo No. 1521, Massachussets Institute of Technology, 1994.
Copy the code

About Learning rate

Finding the right learning Rate

Vector is a very important parameter, this parameter, in the face of different size, different batch – size, different optimization methods, different data sets, the most appropriate values are uncertain, we can’t light from experience to accurately determine the value of lr, the only thing we can do, is the most suitable in training constantly looking for the current state vector.

For example, the following figure uses lr_find() function in FASTAi to find the appropriate learning rate. According to the learning rate-loss curve below, the appropriate learning rate at this time is 1E-2.

If you want to learn more, here’s a blog post by Fastai lead designer Sylvain Gugger: How Do You Find A Good Learning Rate: Cyclical Learning Rates for Training Neural Networks

Relationship between learning-rate and Batch-size

In general, the larger the batch-size, the greater the learning rate used.

The principle is very simple. The larger batch-size is, the greater the confidence of convergence direction will be when we learn, and we will be more determined in the direction of progress. However, the smaller batch-size is more chaotic and irregular, because compared with the larger batch, the smaller batch cannot take into account more situations. So you need a small learning rate to make sure you don’t make mistakes.

The relationship between Loss and learning rate Lr can be seen below:



The figure above is from this article: Visualizing Learning Rate vs Batch Size

Of course, we can also see from the figure above that when batchsize becomes larger, the LR range allowed for good test results becomes smaller. That is to say, when batchsize is very small, it is easier to find an appropriate LR to achieve good results. When batchsize becomes larger, It may be necessary to carefully find a suitable LR to achieve a good result, which also brings difficulties to the actual large batch distributed training.

Therefore, under the condition of sufficient video memory, it is better to use a larger batch-size for training. After finding an appropriate learning rate, the convergence speed can be accelerated. In addition, the larger batch-size avoids some minor issues of batch normalization: github.com/pytorch/pyt…

More similar problems can be found in the zhihu relevant answer: www.zhihu.com/question/64…

Differential learning rate and transfer learning

First, transfer learning. Transfer learning is a very common deep learning technique. We use many classical models of pre-training to directly train ourselves for tasks. Although the fields are different, there is a connection between the two tasks in terms of the breadth of learning weight.

From the figure above, we use the model weights trained by Model A to train our own model weights (Model B), where modelA may be the pre-training weights of ImageNet, while ModelB is the pre-training weights we want to use to identify cats and dogs.

So what is the relationship between differential learning rate and transfer learning? We’re going to take the training weight of other tasks directly, and how to choose the right learning rate is a very important issue when you’re dealing with optimize.

Generally, the neural network designed by us (as shown in the figure below) is generally divided into three parts: input layer, hidden layer and output layer. As the number of layers increases, the features learned by the neural network become more abstract. Therefore, the learning rate of the convolutional layer and the full connection layer in the figure below should also be set differently. Generally speaking, the learning rate of the convolutional layer should be lower, while that of the full connection layer can be appropriately improved.

This is the meaning of differential learning rate. Setting different learning rates at different layers can improve the training effect of neural network. For detailed introduction, please refer to the link below.

The above example figure from: towardsdatascience.com/transfer-le…

Random gradient descent for cosine annealing and thermal restart

Cosine is a curve that looks like cosine, annealing is going down, cosine annealing is learning rate is going down slowly like cosine.

Hot restart is when the learning rate slowly drops and then suddenly bounces back (restarts) and then slowly drops again.

The two are combined to form the change in learning rate below:

More detailed introduction can be viewed zhihu machine learning algorithm how to adjust parameters? SGDR: Stochastic Gradient Descent with Warm Restarts Is a guide for setting neural network learning rate and related papers

Weight initialization

Weight initialization is used less frequently than other tricks.

Why is that?

The reason is simple, because most of the models people use are pre-trained models, using weights that have been trained on large data sets, without having to initialize the weights themselves. Only domains without pre-trained models initialize the weights themselves, or de-initialize the weights of the last few fully connected layers of the neural network in the model.

So what’s your favorite initial weight algorithm?

Of course kaiming_normal or xavier_normal.

Delving Deep into RECTIFIERS Surpassing human-level performance on ImageNet classification Understanding the difficulty of training deep feedforward neural networks

Multiscale training

Multi-scale training is a direct and effective method. By inputting image data sets of different scales, due to the particularity of neural network convolution pooling, the neural network can fully learn the features of images with different resolutions and improve the performance of machine learning.

It can also be used to deal with the over-fitting effect. In the case that the image data set is not particularly sufficient, it can first train the small-size image, and then increase the size and train the same model again. This idea was also mentioned in yOLo-V2’s paper:

It should be noted that multi-scale training is not suitable for all deep learning applications. Multi-scale training can be regarded as a special data enhancement method, which has been adjusted in image size. If possible, it is better to use visual code to observe the multi-scale image closely to see whether the multi-scale will have an impact on the overall information of the image. If there is an impact on the image information, such direct training will mislead the algorithm and lead to inadequate results.

About optimization algorithms

It is logical that different optimization algorithms are suitable for different tasks, but most of the optimization algorithms we use are Adam and SGD+ Monmentum.

Why do these two work?

Experience in bai ~

Tried to fit a small data set.

This is a classic trick, but many people do not do this, you can try.

Turn off regularization/random inactivation/data expansion, use a small part of the training set, and let the neural network train for a few cycles. Make sure you can achieve zero losses, if not, then something is probably wrong.

Cross Validation

In Li Hang’s statistical method, cross validation is often used for insufficient data in practical applications, and the basic purpose is to reuse data. In ordinary times, we divide all the data into training set and verification set, which is a simple cross validation, which can be called 1-fold cross validation. Note that the cross-validation is unrelated to the test set, which is used to measure the criteria of our algorithm and does not participate in the cross-validation.

Cross validation applies only to training and validation sets.

Cross verification is a technique highly recommended in Kaggle competition. We often use 5-fold cross verification, which divides the training set into 5 parts and randomly selects one part as the verification set and the rest as the training set. The cycle is 5 times. There is also a leave-one-out cross validation that leaves a cross validation. This kind of cross validation is n-fold cross, where N represents the capacity of the data set. This method is only suitable for small data volume and rarely used for large computation.

The Nuts and Bolts of Building Applications using Deep Learning was also mentioned in Ng’s lecture.

About the Data set

If the data set is extremely lopsided, if we have a task to detect ships, and only 30,000 of the 100,000 images contain ships, and the rest are images without ships, training for this directly would be poor. To do this, a collection of images only of ships can be selected for initial training. Secondly, the trained model is used to detect the part without training (excluding ships), and the false positive that failed detection is selected (that is, ships are detected in the image set without ships) is added to the previous training set for training again.

Data to enhance

Data set enhancement is a cliche. Why do we need so much data? Why do you need data enhancement? Check out this article: Why does deep learning need so much data?

Here is just a brief description of the data-enhanced Transform we commonly use. Most of the image enhancement techniques we use are usually random rotation, horizontal flip, Gaussian blur and scale variation and stretching and other Blabla operations. These image changes are suitable for most tasks, but for specific tasks, some special image enhancement techniques can achieve better results than ordinary image enhancement techniques.

For example night view or lighting:

However, it should be noted that not all image enhancement can improve the generalization ability of the model.

In addition, some image enhancement technologies will cause loss to the original image, which will lead to the neural network learning wrong information. This is a problem that we tend to ignore, which is equally important. The related content can see why the image enhancement technology in FASTAI is relatively good.

TTA(Test Time Augmentation)

This concept was initially seen in the FASTAI course, and this process is not involved in the training phase, but in the validation and testing phase. The specific process is that several random image enhancement changes are carried out on the images to be processed, and then the enhanced images of each image are predicted, and the average of the prediction results is taken.

The principle is similar to model averaging, in which inference accuracy is improved at the expense of inference speed.

Of course, the technique has its pluses and minuses, with TTA being 0.03 percentage points less accurate in the satellite imagery datasets I ran myself.

conclusion

So to start with, training neural networks is really like alchemy, and all the tricks we use only increase the success rate of alchemy, and the success of alchemy ultimately depends on whether the algorithm we design is reasonable.

Reference article:

www.cnblogs.com/nowgood/p/H… Towardsdatascience.com/kaggle-plan…

This article is from OLDPAN blog, welcome to visit: OLDPAN blog

Welcome to Oldpan blog public account, continue to brew deep learning quality articles: