Eight neural network architectures that MLer must know

Earlier we shared a simple visualization of what a neural network is.

Today we continue with eight important neural network architectures and a brief explanation of their principles and scope of application.

For learning resources on machine learning, Neural Networks in Machine Learning by Geoffrey Hinton is a must-see for many. Geoffrey Hinton is the undisputed godfather of deep learning theory, and he does talk a lot in this course. Program apes James Le guy learning this course, and to share his study notes, Hinton in eight kinds of neural network architecture is explained in this course are summarized, machine learning is learning or coming into pit friends don’t miss this dry, if you like this kind of study notes don’t forget to collect and thumb up, We will increase the publication of such articles.

In general, neural network architecture can be specifically divided into three categories:

One: feedforward neural network

Feedforward neural network is one of the most common neural networks in practical application. The first layer is the input layer, the last layer is the output layer, and the middle is the hidden layer. If there are many hidden layers, we call this a “deep” neural network. They perform a series of transformation operations to change the similarities between the two cases. The activity of neurons in each layer is a nonlinear function of the activity of neurons in the next layer.

Two: circular network

The circular network, when represented on a connection diagram, contains some directional loops, which means that sometimes you follow the arrow and you come back to where you started. Circular networks also have a lot of complex dynamics that make them hard to train. They are closer to the neural networks of the human brain.

At present, there are many interesting researches on finding effective methods to train neural networks. Recurrent neural networks are a very natural way to model sequential data. They are equivalent to a very deep network with a hidden layer in each time slot; Except they use the same weight value in each timeslot and also get input information in each timeslot. It is difficult to train recurrent neural networks to use their potential to retain information for long periods of time while hidden.

Iii. Symmetrically Connected Networks

This network is somewhat similar to a recurrent neural network, but the connections between units are symmetric (they have equal weight in both directions). Symmetric networks are much easier to analyze than cyclic neural networks. Their use is also more limited because they operate as a function of energy. Symmetrical connected networks without hidden elements are called “Hopfield networks”, and those with hidden elements are called “Boltzmann machines”.

Let’s take a look at the eight neural network architectures in detail:

1 machine perception

The first generation of neural network perceptrons were simple computational models with only one neuron. Frank Rosenblatt popularized the perceptron in the 1960s. This neural network has very powerful learning algorithms. In 1969, Minsky and Papers published a book called The Perceptron, an analysis of what the perceptron does and its limitations. Many people believe that all neural networks have these limitations. However, perceptron learning programs are still widely used in many tasks today, and their numerous feature vectors contain millions of features.

In the standard paradigm of statistical model recognition, we first transform the raw input vectors into feature-activated vectors, and then define these features using a common-sense, handwritten program. Next, we will learn how to give weight to each feature activation to obtain a single scalar quantity. If this number is greater than some threshold, we decide that the input vector is a positive example of the target class.

The standard perceptron architecture is a simple feedforward network, which means that input information is fed into neurons, processed, and an output is generated. In the diagram below, it means that the neural network reads values from the bottom up: the input comes from the bottom and the output comes from the top.

However, perceptrons have some limitations: if you choose to manually select features, and use enough of them, you can do almost anything. For binary input vectors, each exponential binary vector will have a separate feature unit, so that we can make any possible discrimination on binary input vectors. But once the hand-coded features are determined, there are strong limitations to what can be learned by perceptronics.

This is disastrous for perceptrons, because the key to pattern recognition is to recognize patterns regardless of the state of the transformation. Minsky et al. ‘s “group invariance theory” states that if the transformation forms a group, part of the learning perceptron cannot learn to recognize patterns. To process this form of transformation, the perceptron needs to use multiple feature units to identify the transformation of the subpattern providing the information. So the trickier part of pattern recognition must be done by hand-coded feature detectors, not by learning.

Neural networks without hidden units are limited in their ability to model input and output mapping. It doesn’t matter how many layers you have in a linear cell, it’s still linear. It is not enough to solve the output nonlinearity by light. In this way, we need to adapt multiple layers of nonlinear hiding cells. But how do we train this neural network? We need an efficient way to accommodate all layers and not just the weights of the last layer. That’s hard to do. Learning the weight of the incoming hidden unit is equivalent to learning features. This is difficult because no one can tell us directly what to do with the hidden unit.

Convolutional neural network

Machine learning research has always paid more attention to object detection, and I once posted such an article on how to extract image features with convolutional neural network.

There are several factors that make it difficult to identify objects:

Segmentation: Real scenes are often cluttered with other objects. It is difficult to tell which fragments should be grouped together and part of the same object. Parts of one object are hidden behind other objects.
Light: Light determines the intensity of the pixel as well as the object.
Morphing: Objects can be morphed in a variety of non-affine ways, such as handwriting with large loops or a single point.
Availability: The classification of objects is often determined by how they are used. Chairs, for example, are made to sit on, so they have many physical shapes.
Perspective: Changes in perspective can lead to changes in the image that standard learning methods cannot handle. Information jumps between input dimensions, such as pixels.

Can you imagine a medical database where a patient’s name sometimes jumps into the input dimension that usually provides the code for the weights? To apply machine learning, we first have to eliminate this dimensional jump problem.

At present, the main way to solve the object detection problem by neural network is the replication feature method. This method utilizes multiple different copies of the same characteristic detector with different positions. It can also be replicated in scaling and orientation, but is trickier and more expensive. Replication can greatly reduce the number of free parameters that the neural network needs to learn. It utilizes several different feature types, each of which has a mapping to the copy detector. It also allows each batch of images to be represented differently.

So what does the copy feature detector get?

Reciprocal activity: The characteristics of replication do not keep neural activity unchanged in response to metastasis; the activity is reciprocal.
Invariance: If a feature is useful at many locations during training, then the detector for that feature is applicable at all locations during training.

In 1998, Yann LeCun and colleagues developed a network for handwriting recognition called LeNet. The neural network is a kind of feedforward network, using “back propagation method”, there are many hidden layers, each layer has many mappings of replication units, and the output of neighboring replication units is pooled. If several features overlap, the network can handle more than one feature at a time, and is a clever way to train an entire system, not just a recognizer. Later, LeNet was renamed as convolutional neural network after optimization. Here’s an interesting thing: about 10% of checks in North America today are checked online.

Convolutional neural network can be used for all kinds of object recognition, from handwriting to 3D object recognition. However, recognizing real objects in color photos downloaded from the Internet is more complicated than recognizing handwriting: categories are hundreds of times larger, pixels are hundreds of times larger, scenes are mixed up and need to be split, scenes from 2d photos are mixed with 3d photos, multiple objects exist in each image, and so on. So does it work using the same convolutional neural network?

Then came the famous ImageNet image Recognition contest, a huge image data set containing about 1.2 million high resolution training images. The test images that enter the contest are represented without original annotations (that is, without segmentation or labels), and the algorithm must also generate specific labels indicating what objects are in the image. Computer vision research teams from Harvard and other leading institutions have tested the best computer vision methods available on ImageNet. Typically, computer vision systems use complex multi-level systems, with the primary level usually having to be manually adjusted to optimize some parameters.

ImageNet 2012 winner Alex Krizhevsky developed a very deep convolutional neural network called AlexNet. Its architecture consists of seven hidden layers and countless maximum pooling layers. The primary layers are all convolution layers, and the last two layers are fully connected layers. The activation function in each hidden layer is a modified linear unit. These functions make training much faster and are more expressive than logical functions. In addition, this neural network also uses competitive standardized methods to compress hidden activity when there is strong activity in neighboring cells, which helps to change the intensity.

There are some techniques that greatly improve the over-fitting problem of this neural network:

224 X 224 patches were selected from 256 X 256 images for random training to obtain more training data, and these images were flipped horizontally. In the test, the conditions of 10 different patches were combined: 4 corners + middle, and then flipped.
Use Dropout to regularize the weights of the full connection layer, which contains most of the parameters. Dropout means removing half of the hidden units from a hierarchy at random for each training sample. This prevents hidden units from becoming overly dependent on other hidden units.

As for the hardware required, Alex deployed convolutional neural networks on an Nvidia GTX 580 GPU. Gpus are great for matrix multiplication, but also have high storage bandwidth. This allowed Alex to train the network within a week and quickly integrate the results from 10 patches in the test. As chips get cheaper and data sets get bigger, large neural networks will optimise faster than older computer vision systems.

3. Recurrent neural network

To understand recurrent neural networks, we need a brief overview of sequence modeling. When applying machine learning to sequences, we often want to turn an input sequence into an output sequence in a different domain, for example, turning a sequence of sound pressure into a sequence of lexical properties. In the absence of a separate target sequence, we can obtain a teaching signal by trying to predict the next item in the input sequence. The target output sequence is the input sequence of the previous step. This is much more natural than predicting a pixel from other pixels, or a patch of a photo from the rest of the photo. When it comes to predicting the next item in the sequence, the distinction between supervised and unsupervised learning becomes blurred. It uses methods used for supervised learning, but does not require a single instructional signal.

It can store information in its implied state for a long time

Recurrent neural networks are powerful because they contain two properties:

Their distributed hidden state allows them to efficiently store large amounts of information from the past
Nonlinear dynamics allow them to update their implicit states in complex ways.

Given enough neurons and time, a recurrent neural network can compute anything that can be computed by your computer. So what is the behavior of the recurrent neural network? They wiggle back and forth, they point to “attractors,” and they’re chaotic. Cyclic neural networks can also potentially learn and apply many small programs that run in parallel and collaborate to produce very complex effects.

However, the complexity of recurrent neural networks also makes them difficult to train. It is difficult to train a recurrent neural network because of the problem of gradients rapidly increasing and disappearing. What happens to the magnitude of the gradient as we propagate back through many layers? If the weight is small, the gradient will shrink exponentially; If the weight is large, the gradient increases exponentially.

Typical feedforward neural networks can cope with these exponential effects because they have very few hidden layers. However, in recurrent neural networks trained with long sequences, gradients can easily increase or disappear rapidly. Even with good initial weights, it is difficult to detect the current target output based on the input several time steps ago, so there are many difficulties in using recurrent neural networks to deal with long-term dependency problems.

There are four effective ways to learn recurrent neural networks:

Long – and short-term memory: Use recurrent neural networks to create small modules for remembering values for a long time.
Hessian Free optimization: The problem of rapid gradient disappearance is handled with a good optimizer that detects directions with small gradients but less curvature.
ESN: Optimization of input -> hidden layer and hidden layer -> hidden layer and output layer -> hidden layer these processes are very closely connected, so the hidden state has a large “reservoir” of coupled oscillators, which can be driven by the input separately.
Good initialization with momentum: similar to optimization in an echo state network, but then all connections are learned using momentum.

4. Long and short term memory model

In 1997, Hochreiter and Schmidhuber constructed the long and short term memory network known by later generations, which solved the problem of realizing long term memory by using recurrent neural network. They designed a memory cell with logical and linear units that multiply interactions. Every time the cell’s write threshold is turned on, information is entered into the cell and stays there until the cell’s Keep threshold is turned on. The cell reads information by turning on its “read” threshold.

Reading squiggly handwriting is a natural task for recurrent neural networks. It usually takes as input the coordinate sequence of the nib (x,y, P), where P indicates whether the pen is up or down, and the output is a feature sequence. A study by Grave and Schmidhuber in 2009 showed that the long and short memory model combined with recurrent neural network is the best system for reading curving script. In short, they used a series of small photographs as input rather than the coordinates of the nib.

Hopfield neural network

Recurrent neural networks of non-linear elements are difficult to analyze and can behave in many different ways: in a stable state, oscillating, or following unpredictable, chaotic trajectories. A Hopfield network consists of two – element threshold loop connections. In 1982, John Hopfield realized that if the connection is symmetric, then there is a global energy function. Each binary “configuration” of the entire network corresponds to an energy, and the binary threshold decision rule makes the neural network accept the minimum of that energy function. An effective way to exploit this type of computation is to use memory as the minimum energy for the neural network. The use of energy minimum-representation memory provides a content-addressable memory so that an item can be accessed only with partial knowledge of its contents, effectively preventing hardware corruption.

Every time we memorize a configuration, we want to create a new energy minimum. But what if the two closest minima are in the middle? This situation limits Hopfield’s performance. So how can we improve Hopfield’s performance? This question is of great interest to many physicists, whose knowledge of mathematics may shed light on how the brain works. A number of papers have been published in many physics journals exploring the Hopfield network and its storage capacity. Eventually Elizabeth Gardner found that the ability to make full use of weights led to better storage performance in neural networks. Instead of trying to store the vector all at once, she loops through the training set several times, and trains each unit according to the states of all the remaining units in the vector with the convergence program of the perceptron to get the correct state. Statisticians call this method “pseudo-likelihood.”

There is another method of calculation in the Hopfield network. Instead of using this neural network to store memories, this approach constructs a deduction of sensory input. Sensory input is represented by visible units, deduction by states of hidden units, and good or bad deduction by energy.

Boltzmann machine network

Boltzmann machine is a kind of random cyclic neural network, which can be regarded as a random and generative corresponding network of Hopfield network. It was one of the first neural networks that could learn internal representations and represent and solve difficult combinatorial problems.

The learning objective of boltzmann machine learning algorithm is to maximize the probability product of binary vectors assigned to the training set by Boltzmann machine, which is equivalent to maximizing the sum of logarithmic probabilities assigned to the training vector by Boltzmann machine.

In 2012, Salakhutdinov and Hinton proposed a small batch learning process for Boltzmann machines:

For the positive phase, the hiding probability is firstly optimized with a probability of 0.5, and then a data vector is fixed on the visible cell, and then all the hidden cells are updated in parallel until convergence begins with mean-field update. After the neural network converges, PiPj values are recorded for each connected unit pair and all data are averaged in small batches.
For the negative phase, start with a set of Fantasy particles, each of which has a value. This is a global configuration. All the cells in each hypothetical particle are then updated several times in order. For each pair of connected cells, the SiSj in all hypothetical particles is averaged.

In ordinary Boltzmann machines, random updates of elements need to be in order. There is a special architecture that allows for alternating parallel updates, which can greatly improve efficiency. This small batch process makes boltzmann machine updates more parallel. This architecture is called a Deep Boltzmann machine (DBM), and it is actually a normal Boltzmann machine with missing connections.

In 2014, Salakhutdinov and Hinton proposed an updated version of their model, called the Restricted Boltzmann machine (RBM). They limit the connectivity in the network to make inference and learning easier (there is only one layer of hidden units, with no connections between hidden units). In a constrained Boltzmann machine. When the visible element is fixed, thermal equilibrium can be achieved in just one step.

Another efficient small-batch learning process for constrained Boltzmann machines is:

For positive phase: first fix a data vector on the visible element, then compute the values for all pairs of visible hidden elements. For each pair of connected cells, average the data in all small batches.
For negative phase: also select a group of “imaginary particles”, and then update each imaginary particle several times in the way of alternating parallel update. For each pair of connected cells, the mean ViHj is calculated over all hypothetical particles.

7 Deep Belief Network (DBN)

Back propagation is regarded as the standard method for calculating the error distribution of each neuron after processing batch data in artificial neural networks. However, there are several major problems with using backpropagation:

It requires well-tagged training data, but almost all of it is in an untagged state.
Learning time does not scale well, which means that when a neural network has multiple hidden layers, learning is very slow.
When the local selection problem is serious, back propagation will be deadlocked, so it is not ideal for deep networks.

To address the limitations of backpropagation, the researchers also considered unsupervised learning. This helps to maintain efficiency and simplicity while using the gradient method to adjust weights, and it can also be used to model the structure of sensory inputs. Specifically, they adjusted the weights to maximize the probability that the generative model would generate sensory inputs. The question is, which generation model should we learn? Could it be an energy model like a Boltzmann machine? Or a causal model made up of idealized neurons? Or a hybrid of the two models?

Deep belief network is a directed acyclic graph composed of random variables. With belief networks, we need to observe some of these variables to solve two problems: 1) inference problem: infer the status of the variables not observed; 2) Learning problem: adjust the interaction between variables to make the neural network more capable of generating training data.

Early graphical models left it up to someone to define chart structure and conditional probability. At the time, charts were loosely connected, so researchers initially focused on making correct inferences rather than learning. For neural networks, learning is the central task. But even so, there are two neuronet belief networks.

The two generative neural networks composed of random binary neurons are as follows: 1) Energy model: in this model, we connect binary random neurons symmetrically to obtain a Boltzmann machine; 2) Causal model: In this model, we connect binary random neurons in a directed acyclic graph to obtain a Sigmoid belief network. This article will not go into detail about these two types of networks.

8 Depth autoencoder

Finally, we talk about depth autoencoders. They are well suited for solving nonlinear dimensionality reduction problems for several reasons: they provide flexible mapping in two ways; Learning time is linear in the number of training cases; The final coding model is very simple and fast. However, it becomes very difficult to optimize depth autoencoders using back propagation. If the initial weight is small, the gradient of back propagation is broken. We now have a better way to optimize depth autoencoders: use unsupervised layer-by-layer pretraining or just carefully initialize the weights, as in ESN.

For the pretraining task, there are three Shallow autoencoders:

RBM as an autoencoder: It is very much like an autoencoder, but is strongly regularized by binary activities in the hidden layer. When RBM is trained with maximum likelihood, it does not act like an autoencoder. Shallow autoencoders can be used instead of RBM for pre-training tasks.
Denoising auto encoders: It adds noise to an input vector by setting some of its components to 0. However, these components still need to be reconstructed, so the desoiling autoencoder must extract features to obtain correlations between inputs. The pre-training task can be effective if many denoising autoencoders are used.
Contractive Auto encoders: Regularize an autocoder by trying to make the activities of hidden units as insensitive to input as possible, but not ignoring the input because they have to refactor the input. We can achieve this goal by penalizing the square gradient of each hidden activity relative to the input. Shrink autoencoders are well suited for pre-training tasks, and only a small subset of hidden units are sensitive to changes in the input.

In short, there are many different ways to pre-train features layer by layer. For data sets without a large number of labeled cases, pre-training is helpful for subsequent identification learning. Initializing weights that use unsupervised pre-training in supervised learning is not necessary for very large, tagged datasets, even for deep networks. So pre-training is the most preferred method to initialize the weight of the deep network, and there are other methods. But if we make the network bigger, we’ll need to train again!

I’m sorry to say so much at once, but I’ll have to repeat myself

Neural networks are one of the most beautiful programming paradigms ever created. In traditional programming, we tell the computer what to do and break down big problems into many small, precisely defined tasks that the computer can easily perform. In a neural network, by contrast, we don’t tell the computer how to solve our problems. Instead, it learns from observing data and figuring out how to solve problems on its own.

Nowadays, deep neural networks and deep learning have made remarkable achievements in computer vision, speech recognition, natural language processing and many other important problems. They are widely used in deployments by companies like Google, Microsoft and Facebook.

All the handouts, research papers and programming assignments for Hinton’s Coursera courses are also available on GitHub.

I hope this article will help you learn the core concepts of neural networks!