The paper contains 2734 words and is expected to last 5 minutes

Deep neural networks can solve complex tasks such as natural language processing, machine vision and speech synthesis. Improving the performance of deep neural networks is as important as understanding how they work. This article will explain the various terms and methods for improving neural networks.

Deviation and variance

Deviation and variance are two basic terms which reflect the performance of network in training set and test set. Bias and variance can be easily and intuitively explained by the following two types of problems. The blue line represents the decision boundary calculated by the neural network.

1. The picture on the far left shows a large deviation of the neural network. At this point, the network learns a simple assumption, so it cannot be used correctly for training data. Therefore, examples of different classes cannot be distinguished, and the training and test sets are poorly processed. We can say that this network is not fit.

2. The picture on the right shows that the neural network has large variance. At this point, the network learns a very complex hypothesis, so it cannot generalize. The processing effect of this method is better for training data, but worse for test data. It can be said that the network is over-fitting.

3. The middle picture shows a “just-right” neural network. It learns ideal assumptions, which help the network screen out anomalies and generalize data. This type of network should be the goal.

Photo credit: pexels.com/pixabay

Training methods

Now that you know what an ideal neural network is, how do you get there? We should deal with bias first and then variance.

The first question is “Are there large deviations?” If the answer is yes, try these steps:

· Training larger networks, including increasing the number of hidden layers and the number of neurons in hidden layers.

· Long hours of training on the network. The training process may be incomplete and require more iterations.

· Try different optimization algorithms, including Adam, Momentum, AdaDelta, etc.

· Iterate through the above steps until the deviation problem is resolved, then tackle the second problem.

If the answer is no, that means the bias problem has been solved, and you should then focus on the variance problem. The second question is “Is there a large variance?” If the answer is yes, try the following steps:

· Collect more training data. The more training data, the more variable the data, and the less complex the assumptions learned from the less variable data.

· Try regularization. More on this in the next section.

· Perform the above steps iteratively until the variance problem is solved.

If the answer is no, that means the variance problem has been solved and the neural network is now “just right”.

regularization

Regularization is a logical technique that helps reduce the overfitting of neural networks. When regularization is added to the network, it means that a new regularization term is added and the loss function is modified. The modified cost function J can be expressed mathematically as:


The second term with λ is the regularization term. | | W | | term for frobenius norm (the sum of the squares of the matrix elements). With the introduction of regularization, λ becomes a new hyperparameter, which can be modified to improve the performance of neural networks. The regularization described above is also called l-2 regularization.

The following update rules were used earlier to update weights:

Since a new regularization term exists in the modified cost function J that includes regularization, the weight will be updated in the following way:

It shows that the weight has been reduced by a fraction less than one. Hence this regularization is also called weight attenuation. The attenuation value depends on the learning rate α and the regularization term λ.

What does regularization work?

The ultimate goal of the training neural network is to minimize the cost function J and thus the regularization term. Now that you understand the concept of regularization, let’s explain why it works.

First, if you increase the value of λ, the Fleubinnis norm becomes smaller and the weight approaches zero. This method eliminates most neurons, creating a shallow network. It can be regarded as transforming the deep network of learning complex hypotheses into the shallow network of learning simple hypotheses. It is well known that simple assumptions can reduce complex features, reduce overfitting, and get a “just right” neural network.

It can also be explained by the way neurons activate when regularization is applied. To do that, you need to understand tanh(x) activation.

If we increase the value of λ, then the Fleubinis norm becomes smaller, that is, the weight W becomes smaller. Therefore, the output of this layer will be smaller and will be in the blue region of the activation function. As you can see, the activation of the blue area is almost linear, and the network will behave like a shallow network, that is, the network will not learn complex assumptions (sharp curves will be avoided), and eventually will reduce overfitting, resulting in a “just right” neural network.

Therefore, because the Flaminis norm will become larger, too small a value of λ will result in an overfit, the neuron will not be cleared, and the output of the layer will not be in the linear region. Similarly, too large a value of λ will result in inadequate fitting. Therefore, finding the optimal value of λ is the key to improve the performance of neural networks.

Discard regularization

Discarding regularization is another regularization technique. This technique discards certain neurons and their connections in neural networks. Keep_prob determines which neurons to discard. After the neurons are removed, the network trains the remaining neurons. It is important to note that during the test time/reasoning time, all neurons are likely to be used to determine the output. The following example helps to understand this concept:

# Define the probablity that a neuron Stays. Keep_prob = 0.5# Create a probability mask for a layer eg. layer 2. 
The mask should
# have same dimensions as the weight matrix so 
that the connections
# can be removed.d2 = np.random.rand(a2.shape[0],a2.shape[1]) < 
keep_prob

# Obtain the new output matrix.a2 = 
np.multiply(a2,d2)

# Since few neurons are removed, we need to 
boost the weights of
# remaining neurons to avoid weight imbalance 
during test time.a2 = a2/keep_probCopy the code

This type of discarding is called inverted discarding because neurons with a probability of keep_PROb are first discarded and the remaining neurons are then enhanced with keep_PROb.


Discarding prohibits neurons from relying only on certain features, so the right is worth propagating. Neurons may become dependent on certain input features to determine the output. Under the influence of discarding regularization, for different training samples in the training process, a particular neuron only gets a few features as input each time. Ultimately, weights are distributed among all inputs, and the network uses all input features to determine the output without relying on any single feature, making the network more robust. This is also called an adaptive form of L2 regularization.

Keep_prob can also be set separately for each layer. Because the number of neurons discarded is inversely proportional to keep_prob; The general criterion for establishing keep_PROB is that dense connections should contain relatively few keep_proBs in order to discard more neurons, and vice versa.

When discarding regularization, the deep network mimics the work of the shallow network during the training phase. This in turn reduces overfitting and results in “just right” neural networks.

Early stop

Stopping early is a training method. Stop training the neural network at an early stage to prevent it from overfitting and combine tracking train_loss and dev_loss to determine when to stop training.


As soon as Dev_Loss starts iterating over; He stopped training. This method is called early stop. However, it is not recommended for network training for two reasons:

1. The loss is not minimal when you stop training.

2. Over-fitting of poorly trained networks is being reduced.

Stopping early adds complexity and prevents “just right” neural networks.

Leave a comment like follow

We share the dry goods of AI learning and development

Welcome to “core reading” of AI vertical we-media of the whole platform

(Add wechat: DXSXBB, join readers’ circle and discuss the freshest artificial intelligence technology.)