Alitao Department business Team – Innovation

This article is the translation and supplement of the technical article “From Perceptron to Deep Neural Nets”. Posted by Adi Chris

As a machine learning engineer, I’ve been working on deep learning for some time. Now, after completing all of Andrew NG’s latest Coursera deep learning courses, I’ve decided to post some of what I know about this area in a blog post. I find that writing down is an effective way to solve a topic. In addition, I hope this article may be useful to anyone who wants to start deep learning.

All right, let’s talk about deep learning. Oh wait, before I get directly into what deep learning or deep neural networks (DNN) are, I want to start this article by introducing a simple problem that I hope will lead us to a better intuitive (deep) neural network for why we need it. By the way, I’ll also be publishing code on Github with this article that allows you to train a deep neural network model to solve the XOR problem below.

Xor problem

An xor problem is one where given two binary inputs, we must predict the output of an xor logic gate. As a reminder, the XOR function should return 1 if the two inputs are not equal, and 0 otherwise. Table 1 below lists all possible inputs and outputs for XOR functionality:


Now, let’s plot the data set to see what the properties of the data are.

def plot_data(data, labels) :
    """ argument: data: np.array containing the input value labels: 1d numpy array containing the expected label """
    positives = data[labels == 1, :]
    negatives = data[labels == 0, :]
    plt.scatter(positives[:, 0], positives[:, 1], 
                       color='red', marker='+', s=200)
    plt.scatter(negatives[:, 0], negatives[:, 1], 
                       color='blue', marker='_', s=200)

positives = np.array([[1.0], [0.1]])
negatives = np.array([[0.0], [1.1]])

data = np.concatenate([positives, negatives])
labels = np.array([1.1.Copy the code


Perhaps after seeing the figure above, we might want to reconsider whether this XOR problem is really a simple one. As you can see, our data is not linearly separable, so some well-known linear models (such as logistic regression) may not be able to classify our data. To give you a clearer understanding, I have plotted some decision boundaries built using a very simple linear model:


Looking at the diagram above, it is clear that we need a classifier that works well for nonlinear separable data. SVM with kernel skills might be a good choice. In this article, however, we’ll build a neural network instead and look at how that neural network can help us solve the XOR problem.

What is a neural network?

A neural network, or artificial neural network, is a very good functional approximation based loosely on the way people think the brain works. The diagram below shows the analogy between human biological neurons and artificial neural networks.


Without going into too much detail about biological neurons, I’ll give you a high-level visual introduction to how biological neurons process information. Our neurons receive signals through dendrites. These messages or signals are then transmitted to the Soma or cell body. In the cell body, all the information is aggregated to produce the output. When the sum reaches a threshold, the neuron fires and the information is sent down the axon and then through its synapses to other connected neurons. The amount of signal transmitted between neurons depends on the strength of the connection.

The whole process is adopted by artificial neural network. You can think of dendrites as weighted inputs based on synaptic interconnections in artificial neural networks. The weighted inputs are then aggregated into the “cell body” of the artificial neural network. If the output generated is greater than the threshold units, the neuron will “fire” and the output of that neuron will be passed on to other neurons. Thus, you can see that ANN is modeled using the work of basic biological neurons.

So, how does this neural network work?

To understand how this neural network works, let’s first look at a very simple artificial neural network called Perceptron. To me, Perceptron is one of the most elegant algorithms in machine learning. This simple algorithm, created in the 1950s, is arguably the starting point for such important developments in machine learning algorithms as logistic regression, support vector machines and even deep neural networks. So how does the perceptron work? We will use the picture below as a starting point for our discussion.


The figure above shows a perceptron algorithm with three inputs x1, X2, and x3, and a neuron unit that can generate output values. To generate the output, Rosenblatt introduced a simple rule by introducing the concept of weights. Weights are basically real numbers, representing the importance of each input to the output [1]. The neuron described above will generate two possible values, 0 or 1, determined by whether the weighted sum of each input is less than or greater than some threshold. Therefore, the main idea of the perceptron algorithm is to learn the value of the weight W and then multiply the weight W by the input feature to determine whether the neuron fires. We can write it using the following mathematical expression:


We can now modify the formula above by doing two things: First, we can convert the weighted sum formula to the dot product of two vectors W (weights) and x (inputs), where W ⋅x≡∑ WJXJ. We can then move the threshold to the other end of the inequality and replace it with a new variable called the bias B, where B ≡-threshold. Now, with these modifications, our perceptron rule can be rewritten as:


Now, when we put everything back on the perceptron architecture, we will provide the complete architecture for a single perceptron, as follows:


A typical single-layer perceptron uses the Heaviside step function as an activation function to convert the resulting value to 0 or 1, thereby classifying the input value as 0 or 1. As shown in the figure below, the Heaviside step function will have negative output zero. Argument and a positive argument.


The Heaviside step function is particularly useful in classification tasks if the input data is linearly separable. But, recall, our goal is to find a classifier that works well in nonlinear separable data. Therefore, neither the single-layer perceptron nor the Heaviside step function makes sense here. Later, as you will see in the next section, we will need a layer consisting of multiple perceptrons and a nonlinear activation function.

Specifically, there are two main reasons why we can’t use the Heaviside step function:

  1. Currently, one of the most effective ways to train multilayer neural networks is to use gradient descent in combination with back propagation (we will discuss both methods shortly). The backpropagation algorithm requires a differentiable activation function. But the Heaviside step function is not differentiable at x = 0, and its derivative is 0 elsewhere. This means that gradient descent will not make progress in updating the weights.
  2. Recall that the main goal of the neural network is to learn the values of weights and biases so that the model can produce predictions as close to the actual values as possible. For this, as with many optimization problems, we want small changes in weights or biases that result in only a small corresponding change in the network output. Having functions that can only generate 0 or 1 (or yes and no) will not help us achieve this goal.

The activation function

Activation function is one of the most important components in neural network. In particular, nonlinear activation functions are essential for at least three reasons:

  1. It helps neurons learn and understand really complex things.
  2. They introduce nonlinearity into our network.
  3. We expect small changes in weight to cause only corresponding small changes in network output.

We have seen that the Heaviside step function is an example of an activation function, but in this particular section we will explore several nonlinear activation functions commonly used in the deep learning community. By the way, activation functions can be explained in more depth in these two excellent articles by Avinash Sharma and Karpathy, including the strengths and weaknesses of each nonlinear activation function.

Sigmoid Function

An s-shaped function, also known as a logical function, is a function given input that will produce output in the range (0,1). Sigmoid can be written as:



The figure above plots the shape of an S-type function. As you can see, it is like a smooth version of the Heaviside stepper function. However, due to a number of factors, s-type functions are preferred:

  1. It’s essentially nonlinear.
  2. Now, instead of printing 0 and 1, we can give a function that outputs 0.67. Yes, as you might guess, it can be used to represent probability values.
  3. Still related to point (2), now our activation range is limited to a range, which means it doesn’t blow up those activations.

However, the S-type activation function has some disadvantages: vanishing gradients. As you can see from the figure above, when the input value z of the function is small (moving in the -INF direction), the output of the S-type function will be close to zero. Conversely, when z is large (moving in the direction of + INF), the output of the S-type function will approach 1. So what does this mean? In this region, the gradient will be very small or even disappear. Disappearing gradients are a big problem, especially in deep learning, because we stack layers of this kind of nonlinearity together, because even large changes in the parameters of the first layer will not change the output. In other words, the network refuses to learn, and the time it takes to train the model generally gets slower and slower, especially if gradient descent algorithms are used. Another disadvantage of sigmoID activation functions is that calculating exponents can be expensive in practice. It can be said, though, that activation functions are only a small part of computing in the deep web compared to matrix multiplication and/or convolution, so this may not be a big problem. But I think it’s worth mentioning.

Tanh Function

Tanh or hyperbolic tangent is another activation function commonly used in deep neural networks. This function is very similar in nature to the Sigmoid function, where it compresses the input into a nice bounded range value. Specifically, given a value, tanh will produce an output value between -1 and 1.



As mentioned earlier, tanH activation functions have properties similar to s-type functions. It is non-linear and bound to some range, in this case (-1, 1). Also, not surprisingly, TANH has the same disadvantages as type S. It suffers from gradient disappearance, and as you can see from the mathematical formula, we need to compute the exponent, which is usually computationally inefficient.

ReLu (Rectified Linear Unit)

This is ReLu, and it’s an activation feature that’s not expected to be any better than Sigmoid and TANh, but in fact, it does! In fact, the lecture defaults to using ReLU nonlinearities. ReLu has very good mathematical characteristics and is very efficient in computation. Given the input value, ReLu will generate 0 if the input is less than 0, otherwise the output will be the same as the input. Mathematically, this is the form of the ReLu function.



Now, you might ask, “Isn’t this a linear function? Why do we call ReLu a nonlinear function?” First, let’s first understand what a linear function is. Wikipedia says:

In linear algebra, a linear function is a map f between two vector spaces that preserves vector addition and scalar multiplication:

f(x + y) = f(x) + f(y) f(ax) = af(x)

Given the above definition, we can see that Max (0, x) is piecewise linear. This is piecewise because it is linear only if we limit its field to (-INF, 0] or [0, + INF). However, it is not linear over the entire field. For example:

F (−1) + F (1) ≠f (0)

Therefore, we now know that ReLu is a nonlinear activation function, which has good mathematical properties and is more computationally efficient than S-type or Tanh. In addition, ReLu is known to “eliminate” the problem of gradient disappearance. ReLu, however, has one major drawback: “Dying ReLu.” Dying ReLu is a phenomenon in which neurons in a network die permanently because they cannot fire forward.

More precisely, this problem occurs when the neuron generates an activation value of zero as it passes forward, causing its weight to change to a zero gradient. As a result, when we propagate back, the weight of that neuron will never be updated, and that particular neuron will never be activated. I strongly encourage you to watch this lecture video, which goes into more depth on this particular issue and how to avoid dying ReLu. Please go and check!

Oh, one other thing I should mention about ReLu. You may notice that, unlike Type S and TANH, ReLu does not limit the output value. Since this may not be a big problem in general, it can become troublesome in another variant of deep learning models such as recursive neural networks (RNN). In particular, the infinite value generated by ReLu may make it possible for calculations in RNN to explode to infinity without reasonable weights. As a result, learning can be very unstable, because a slight shift in the weight in the wrong direction during back propagation destroys activation during forward propagation.

How do neural networks predict and learn?


The architecture described in the figure above is called a multi-layer perceptron (MLP). As the name implies, in MLP, we simply stack multiple perceptrons into several layers. The network described above has three layers: an input layer, a hidden layer, and an output layer. However, in the deep learning or neural network community, people do not refer to this network as a three-layer neural network. Usually, we only count the number of hidden layers or the number of hidden layers and the output layer, so it’s a two-layer neural network. The hidden layer represents only the input or output layer. Now, as you might guess, the term deep learning simply means that we have “more” hidden layers :). So how do neural networks produce predictions?

The neural network generates predictions after passing all inputs through all layers up to the output layer. This process is called feedforward. As you can see from the figure above, we “feed” the network with input X, compute the activation function and pass it layer by layer until we reach the output layer. In supervised setup tasks (such as categorization tasks), we usually use s-type activation functions in the output layer because we can convert its output to probability. In the figure above, we can see that the output layer generates a value of 0.24, and since this value is less than 0.5, we can say that the prediction y_hat is zero. Then, as in a typical classification task, we will have a cost function that measures how close the model is to the real label. In practice, training in a neural network simply means keeping costs as low as possible. We can define the cost function as follows:


Therefore, the goal is to find some combination of W and B so that our cost J is as small as possible. To do this, we will rely on two important algorithms: gradient descent and back propagation.

Gradient descent algorithm

For those of you who have been doing machine learning, they probably already know about gradient descent. Training a neural network is not much different from training any other machine learning model using gradient descent. The only significant difference is the nonlinear influence in our network, which makes our cost function non-convex. To provide a better intuition, let’s assume that our cost function is a convex function (a big bowl), as shown below:


In the figure above, the horizontal axis represents the space of our parameters, weights, and deviations, while the cost function J (w, b) is some surface above the horizontal axis. The red circle in the figure above is our original cost minus the weights and deviations. To minimize costs, we now know we have to go the steepest route. But the question is, how do we know which way to go? Should we increase or decrease the value of the parameter? We could do a random search, but that would take a long time and obviously be computationally expensive. When adjusting learnable parameters, weights, and biases, there are better ways to figure out where to go. Calculus tells us that the gradient vector will naturally point in the steepest direction at a given point. Therefore, we will use the gradient of the cost function, regardless of weight and bias. Now, let’s simplify things by looking only at the cost of weights, as shown in the figure below.



Express the value of our cost function W.R.T as the value of the weight. You can see the black circle above as our original fee. Recall that the gradient of a function or variable can be positive, zero, or negative. A negative slope means that the line slopes downward and vice versa. Now, since our goal is to minimize costs, we need to move the weights in the opposite direction of the gradient of the cost function. This update process can be written as follows:


Where α is the step size or learning rate, which we multiply by the partial derivative of cost times the learnable parameter. So what does alpha do? Well, the gradient tells us the direction in which the function has the steepest rate, but it doesn’t tell us how far we should go in that direction. This is the alpha we need, which is a hyperparameter that basically controls the step size, for example, how much we should move in a certain direction. Choosing the right value for the learning rate is important because it will greatly affect two things: the speed of the learning algorithm and whether we can find local optima (convergence). In practice, you may want to use adaptive learning rate algorithms such as Momentum, RMSProp, Adam, etc. A guy from AYLIEN wrote a very nice article on various optimization and adaptive learning rate algorithms.

Backpropagation

In the previous section, we discussed the gradient descent algorithm, which is an optimization algorithm that we use as a learning algorithm in deep neural networks. Recall that using gradient descent means that we need to find the gradients of the cost function W.R.T and our learnable parameters W and B. In other words, we need to compute the partial derivatives of the cost functions W.R.W and b. However, if we observe the cost function J (as shown in Figure 12 below), there is no direct relationship between J and either W or B.


It is only when we trace from the output layer (the layer that generated y_hat) back to the input layer that we see an indirect relationship between J and both W and B, as shown in Figure 13 below:


Now, you can see that in order to find the cost gradient for W and B, we need to find one with all the variables (such as a (activation function) and z (partial derivative of linearly computed cost: wx + b). This is where we need back propagation. Back propagation is basically a repeated application of the chain calculus rule for differentiation, which I think is probably the most efficient way to find the cost gradient of all the learning parameters in a neural network. In this article, I will take you to calculate the gradient of the cost function JW.R.tw2, which is the weight of the second layer of the neural network. For simplicity, we will use the architecture shown in Figure 8, where we have a hidden layer with three hidden neurons.


To find the rate of change of y_hat W.R.tz2, we need to distinguish the S-type activation function with respect to Z.


Now, once you have the value of the partial derivative Jw.R.tw2, you can update the value of W2 using the formula shown in Figure 11 in the previous section. Basically, we will repeat the same calculation for ownership weight and deviation until we get the smallest possible cost value.

Neural networks solve xOR problems

What a great! I think we’ve covered almost everything we need to build a neural network model or even a deep learning model that will help us solve the XOR problem. In writing this article, I built a simple neural network model with only one hidden layer with varying numbers of hidden neurons. The network example I used is shown below. In addition, I describe some decision boundaries generated by my model using different numbers of neurons. As you’ll see later, we can say that having more neurons makes our model more complex, creating more complex decision boundaries.



But what is the best option? Have more neurons or deeper, which means there are more layers? In theory, the main benefit of a very deep network is that it can represent very complex functions. Specifically, by using a deeper architecture, we can learn features at many different levels of abstraction, such as identifying edges (at a lower level) as very complex features (at a deeper level).

However, using deeper networks is not always helpful in practice. The biggest problem we will encounter when training deeper networks is the problem of fading gradients: very deep networks often have gradient signals that quickly go to zero, making gradient descent unbearable.

More specifically, in gradient descent, as we propagate back from the last layer to the first layer, we multiply the weight matrix at each step, so that the gradient can rapidly descend exponentially to zero, or in rare cases, rapidly grow exponentially and “explode” to very large values.

So, to end this lengthy article, here are some key points that can be briefly summarized:

  • Intuitive neural networks introduce nonlinearity into models and can be used to solve complex nonlinear separable data.

  • Perceptron is an elegant algorithm that powers many of the most advanced algorithms in machine learning, including deep learning.

  • Intuitively, deep learning means using neural networks with more hidden layers. Of course, there are many variations, such as convolutional neural networks, recursive neural networks and so on.

  • Activation functions are a very important part of neural networks, and yes, you should know about them.

  • So far, backpropagation gradient descent is the best combination we use to train (deep) neural networks.

  • Having more hidden layers does not necessarily improve the performance of our model. In fact, it suffers from a well-known problem, the vanishing gradient problem.



    Tao department front – F-X-team opened a weibo! (Visible after microblog recording)
    In addition to the article there is more team content to unlock 🔓