1. The introduction

Nowadays, deep learning is very popular, and various new artificial neural networks emerge from time to time. It is not easy to understand the architecture of these new neural networks in real time. Just knowing the abbreviations of various neural network models (such as DCIGN, BiLSTM, DCGAN… What else?) “Is already overwhelming.

So, here’s a list to sort through all of them. Most of them are artificial neural networks, but there are some completely different monsters. Although all of these architectures are different and have unique functions, when I plot their nodes… The underlying relationship is beginning to become clear.

One problem with turning these architectures into node diagrams is that they do not show the inner workings of neural network architectures. For example, VAE: variational autoencoders look similar to AE: autoencoders, but their training process is very different. The trained models differ more in use scenarios: VAE are generators, which obtain new samples by inserting noise data; AE, on the other hand, simply took whatever information they received as input and mapped it to the most similar training sample “in memory.”

Before introducing the connections between neurons and nerve cell layers in different models, let’s take a step by step look at how the inner workings of different neuron nodes work.

1.1 neurons

Marking different types of neurons with different colors can better distinguish between various network architectures. But the way these neurons work is pretty much the same. You can see a detailed explanation of the basic neuron structure below:

The basic neural network cells are fairly simple, and the simple type can be found in conventional feedforward neural network architectures. The connections between this neuron and other neurons have weights, that is, it can have connections with all the neurons in the previous neural network layer.

Each connection has its own weight, usually some random values (how to initialize the weight of the artificial neural network is a very important topic, which will directly affect the subsequent training process and ultimately the performance of the whole model). The weight can be negative, positive, very small, very large, or zero. The values of all the neurons connected to this neuron are multiplied by their respective weights. And then, you sum up all of these values.

This is supplemented by a bias, which can be used to avoid zero output and speed up certain operations, reducing the number of neurons needed to solve a problem. The bias is also a number, sometimes a constant (usually -1 or 1) and sometimes variable. This sum is eventually fed into an activation function, and the output of the activation function is ultimately the output of the neuron.

1.2 Convolutional Cells

They’re very similar to feed-forward neurons, except that they only have connections to parts of the previous neuron layer. Because they’re not connected randomly to certain neurons, they’re connected to a specific range of neurons, usually to store spatial information. This makes them very useful for people who have a lot of local information, such as image data, voice data (but mostly image data).

1.3 Deconvolution neuron

Quite the opposite: they decode spatial information by connecting to the next layer of nerve cells. There are many copies of both types of neurons, and they are independently trained; Each copy has its own weight but is connected in exactly the same way. These copies can be thought of as being placed in different neural networks with the same structure. These two types of neurons are essentially normal neurons, but they are used differently.

1.4 Pooling and interpolating Cells

Often used in conjunction with convolutional neurons. They’re not really neurons, they can only do simple things.

Pooled neurons receive output from other neurons and decide which values can pass and which cannot. In the field of images, can be understood as an image is reduced (in view of the picture, the general software has a zoom in, zoom out function; Here, the image is shrunk, just like the image is shrunk in software; That means we can see less of the image; During this pooling process, the size of the image is reduced accordingly. That way, you can no longer see all the pixels, and the pooling function knows which pixels to keep and which to discard.

Interpolation neurons do exactly the opposite: they take some information and map out more. The extra information is produced in a way that is like zooming in on a small resolution image. Interpolation neurons are not only the reverse of pooled neurons, but they are also common because they run very fast and are simple to implement. The relationship between pooling neurons and interpolation neurons is just like the relationship between convolution neurons and deconvolution neurons.

1.5 Mean and Standard deviation cells (as probabilistic neurons they always appear in pairs)

Is a class of neurons used to describe the probability distribution of data. The mean is the average of all values, and the standard variance describes how far the data deviates from the mean. For example, a probabilistic neuron for image processing can contain information such as how much red there is in a particular pixel. For example, the mean might be 0.5 with a standard variance of 0.2. When sampling from these probabilistic neurons, you can input the values into a Gaussian random number generator, which generates some values between 0.4 and 0.6; The farther the value is from 0.5, the less likely it is to be generated. They are generally fully connected to the previous neuron layer or the next neuron layer, and they have no bias.

1.6 Recurrent cells

There are connections not only between nerve cell layers, but also along the timeline. Each neuron keeps its previous value inside. They update like normal neurons, but with additional weights: between the previous values of the current neuron and, in most cases, between individual neurons in the same neuron layer. The weight between the current value and the stored previous value works much like that of nonpermanent memory (such as RAM), inheriting two properties:

  • First, to maintain a particular state;
  • Second: if it is not continuously updated (input), the state will disappear.

Since the previous value was obtained through the activation function, this value, along with other weights, is entered into the activation function with each update, information is continuously lost. In fact, the retention rate of information is so low that after only four or five iterations, almost all previous information is lost.

Long Short Term Memory Cells

To overcome the problem of rapid loss of information in circulating neurons.

LSTM is a logic loop whose design is inspired by the design of computer memory cells. In contrast to cyclic neurons that store only two states, LSTM can store four states: current and previous values of output values, and current and previous values of memory neuron states. They all have three gates: an input gate, an output gate, and a forget gate, and they also have a regular input.

Each of these gates has its own weight, meaning that connections to this type of neuron require four weights (instead of one). These gates work much like flow gates, not fence gates: they let all information through, or only some, or nothing through, or a section of information through.

This mechanism is achieved by multiplying the input information by a coefficient between 0 and 1, which is stored in the current gate. Thus, the input gate determines how much of the input information can be added to the current gate value. The output gate determines how much of the output information can be passed to the subsequent neural network. Instead of connecting to the previous value of the output neuron, the forgetting gate connects to the previous memory neuron. It determines how much information about the latest state of the memory neuron is retained. Because there is no connection to the output and no activation function in the loop, less information is lost.

1.8 Gated Recurrent Units (Cells)

It’s a variant of LSTM. They also use gates to suppress information loss, but only two gates: the update gate and the reset gate. This makes them less expensive to build and faster because they use fewer connections all over the place.

There are essentially two differences between LSTM and GRU:

  • First, GRU neurons have no hidden neurons protected by the output gate.
  • Second: GRU integrated the output gate and the forgotten gate together, forming the update gate. The idea is that if you want some new information, you can forget some old information (and vice versa).

1.9 Neural cell Layer (Layers)

To form a neural network, the simplest way to connect neurons is to connect all the neurons to all the other neurons. This is like the way the Hopfield neural network is connected to Boltzmann Machines. Of course, this means that the number of connections increases exponentially as the number of neurons increases, but the corresponding function also becomes more expressive. This is what is called completely (or fully) connected.

Over time, it turned out that breaking up a neural network into layers of nerve cells was very effective. The neuron layer is defined as a group of neurons that are not connected to each other, but only to other neuron layers. This concept is embodied in Restricted Boltzmann Machines. Now, using neural networks means using layers of neurons, and any number of layers of neurons. One of the more confusing concepts is fully connected or completely connected, where every neuron in one layer is connected to all the neurons in another layer, but truly fully connected neural networks are quite rare.

1.10 Convolutional Connected Layers

There are more limitations than the full connection layer: each neuron in the convolutional connection layer is only connected to the adjacent neuron layer. Images and sounds contain a great deal of information, if fed one-to-one into a neural network (for example, one neuron per pixel). The formation of convolution connections benefits from the observation that preserving spatial information is more important. It turned out to be a pretty good guess, because most of the imaging and speech applications based on artificial neural networks now use this kind of connection. However, the cost of this connection is much lower than that of the full connection layer. In essence, convolutional join plays a role of importance filtering, determining which tightly linked packets are important; Convolution joins are useful for dimensionality reduction.

The alternative, of course, is to make random connected neurons. There are two main variants of this form of joining:

  • First, allow some neurons to fully connect.
  • Second, the layers of neurons are only partially connected. Random connection is helpful to reduce the performance of artificial neural network linearly. In large-scale artificial neural networks, random connection is very beneficial when the full connection layer encounters performance problems. A more sparse neuron layer with more neurons works better in some cases, especially when a lot of information needs to be stored, but not much information needs to be exchanged (this is similar to the operation mechanism of the convolution connection layer, but they are random). Very sparse connectivity networks (1% or 2%) are also used, such as ELMs, ESNs and LSMs. This is especially true for spiking networks, because the more connections a neuron has, the less energy its corresponding weight has, which means there will be fewer extended and repeated patterns.

1.11 Time Delayed Connections

Connected neurons (usually in the same neuron layer, or even one neuron connected to itself) that do not get information from the previous neuron layer, but from the previous state of the neuron layer. This allows information that is temporarily (temporally or sequentially) connected to be stored. These forms of connections are often manually reset to clean up the state of the neural network. The main difference from regular connections is that the connections are constantly changing, even if the neural network is not currently being trained.

The following figure shows the neural network described above and its connection. I use this diagram as a reference when I’m stuck on which neuron and which neuron layer should be connected (especially when dealing with and analyzing LSTM and GRU neurons) :

Obviously, it would be impractical to compile a complete list, as new architectures are being invented constantly. So the purpose of this list is just to give you a glimpse into the infrastructure of artificial intelligence. I write a very, very brief description of each architecture as a node graph. You will find these descriptions useful, after all, there will always be some architecture that you are not familiar with.

It is worth noting that while most abbreviations are generally accepted, there is always some conflict. RNNs sometimes stand for recursive neural networks, but most of the time they refer to recurrent neural networks. That’s not all. They refer to various circular architectures in many places, including LSTMs, GRU, and even bidirectional variants. AEs often faces the same problem, VAEs, DAEs, and similar structures are sometimes referred to as AEs. The ‘N’ after many abbreviations also changes, because the same architecture can be called both convolutional neural network and convolutional network, resulting in both CNN and CN.

2. Feedforward Neural Network (FFNN)

Feedforward neural networks and perceptrons (FF or FFNN: Feed forward neural networks and P: perceptrons) are very simple. Information flows from front to back (corresponding to input and output respectively).

Generally, when describing neural network, we start from its layer, that is, the parallel input layer, hidden layer or output layer neural structure. Inside individual layers of nerve cells, neurons are disconnected from each other; Normally, two adjacent layers of nerve cells are fully connected (each neuron in one layer is connected to each neuron in the other). The simplest but most practical neural network consists of two input neurons and one output neuron, namely a logic gate model. Feedforward neural networks (FFNNs) are trained by backpropagation algorithms given a pair of data sets (” input data set “and” we expect output data set “).

This is called supervised learning. The opposite of this is unsupervised learning: we just give the input and let the neural network look for patterns in the data. Backpropagation errors are often some variation of the difference between the current output of the neural network and a given output (such as MSE or just a linear change in the difference). If a neural network has enough hidden layer neurons, it can theoretically always establish a relationship between input data and output data. In practice, the use of FFNN has great limitations, but they are often combined with other neural networks to form new architectures.

Reference: Rosenblatt, Frank. “The Perceptron: A probabilistic Model for information storage and organization in the brain. “Psychological Review 65.6 (1958): 386.

3. Radial basis Neural Network (RBF)

Radial basis function (RBF) is a feedforward neural network with Radial basis kernel function as the activation function. No more description. That’s not to say there aren’t related applications, but most FFNNs that use other functions as activation functions don’t have their own names. It may have something to do with when they were invented.

References: Broomhead, David S., and David Lowe. Radial basis functions, multi-variable functional interpolation and adaptive networks. No. RSRE-MEMO-4148. ROYAL SIGNALS AND RADAR ESTABLISHMENT MALVERN (UNITED KINGDOM), 1988.

4. Hopfield Network (HN)

HN: Hopfield network is a network in which each neuron connects with other neurons.

It’s like a perfectly mixed plate of spaghetti, because every neuron is playing all the roles: every node before training is an input neuron, the training is a hidden neuron, and the output is an output neuron.

The training of the neural network is to first set the value of neurons to the desired mode, and then calculate the corresponding weight. After that, the weights won’t change anymore. Once the network has been trained to include one or more patterns, the neural network will always converge to one of the learned patterns, because it will be stable only in one state. It is worth noting that it does not necessarily conform to that desired state (unfortunately, it is not the black box with magic). It stabilizes in part because the “Energy” or “Temperature” of the entire network gradually decreases during training. The activation function threshold for each neuron is set to this temperature value, and once the sum of neuron inputs exceeds this threshold, the current neuron selection state (usually -1 or 1, sometimes 0 or 1) is given.

Multiple neurons can be synchronized, or the network can be updated neuron by neuron. Once all the neurons have been replaced and they never change, the network is stable (annealed), and you can say that the network has converged. This type of network is called “associative memory” because it converges to a state most similar to the input; For example, humans can look at one half of a table and imagine the other half; Similarly, if you input half the noise plus half the table, the network converges to the entire table.

References: Hopfield, John J. “Neural Networks and Physical Systems with Emergent Collective computational Abilities.” Proceedings of the National Academy of Sciences 79.8 (1982): 2554-2558.

5. Markov Chain (MC)

Markov chains (MC: Markov Chain) or discrete time Markov chains (DTMC: MC or discrete time Markov Chain) are in a sense the forerunner of BMs and HNs. You can think of it this way: What’s the probability of starting at my current node and going to any neighboring node? They have no memory (the so-called Markov property) : each state you get is completely dependent on the previous one. It’s not a neural network, but it’s similar to a neural network, and it forms the basis of BM and HN. Like BM, RBM, and HN, MC is not always considered a neural network. Also, it’s not always fully connected.

Hayes, Brian. “First Links in the Markov chain.” American Scientist 101.2 (2013): 252.

Boltzmann machine (BM)

BM: Boltzmann Machines are very similar to Hopfield networks, except that some neurons are input neurons and the rest are hidden neurons.

After the whole neural network is updated, the input neuron becomes the output neuron. At first the weights of the neurons were random, learned through back-propagation algorithms or the more recently used algorithm for Contrastive Divergence (markov chains used to calculate the gradient between two information gains).

Compared to HN, most of the neuron activation patterns in BM are binary. BM is obtained by MC training, so it is a random network. BM is trained and operated in much the same way as the HN: it sets a clamp value for the input neuron and lets the neural network learn on its own. Because these neurons can get arbitrary values, we repeatedly go back and forth between the input and output neurons. The activation of the activation function is controlled by the global temperature. If the global temperature decreases, the energy of the neuron will decrease correspondingly. This reduction in energy causes their activation patterns to stabilize. At the right temperature, the network reaches an equilibrium. References: Hinton, Geoffrey E., “Learning and Releaming in Boltzmann Machines.” Parallel Distributed Processing: Explorations in the microstructure of cognition 1 (1986): 282-317.

8. Restricted Boltzmann Machine (RBM)

The RBM: Restricted Boltzmann Machines was surprisingly similar to the BM, and thus to the HN.

The big difference is that RBMS are more practical because they are more restricted. They don’t randomly make connections between all neurons, they just make connections between different groups of neurons, so no input neuron can connect to any other input neuron, and no hidden neuron can connect to any other hidden neuron.

RBM is trained like FFNN with a slight modification: forward through the data and then back (back to the first layer), rather than forward through the data and back to propagate the error. After that, forward and back propagation are used for training.

References: Smolensky, Paul. Information Processing in dynamical Systems: Foundations of harmony theory. No. CU-CS-321-86. COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986.

9. Self-coding machine (AE)

Autoencoders (AE: Autoencoders) are somewhat similar to FFNN in that it is more like another use of FFNN than another fundamentally different architecture.

The basic idea of an autoencoder is to automatically encode information (like compression, not encryption), hence its name. The shape of the whole network resembles an hourglass timer, the hidden layer in the middle is smaller, and the input layer and output layer on both sides are larger. Autoencoders are always symmetric, with an intermediate layer (one or two depending on the odd and even number of neural network layers) as the axis. The smallest layer (one or more) is always in the middle, where information compression is greatest (the pass of the entire network). Before the middle layer is the encoding part, after the middle layer is the decoding part, the middle layer is the encoding part.

The self-encoding machine can be trained by a backpropagation algorithm. Given an input, the error is set as the difference between the input and output. The weight of the self-encoding machine is also symmetric, so the weight of the encoding part is exactly the same as the weight of the decoding part.

References: Bourlard, Herve, And Yves Kamp. “Auto-association by multilayer perceptrons and Singular value decomposition.” Biological Cybernetics 59.4 5 (1988) : 291-294.

10. Sparse Auto-coding Machine (SAE)

SAE: Sparse Autoencoders are somewhat the opposite of autoencoders. Sparse autoencoder does not represent a large amount of information in a smaller space, but encodes the original information into a larger space. Thus, instead of converging, the middle layer expands and then reverts back to the input size. It can be used to extract small features within a data set.

If we train sparse autocoders in the same way we train them, in almost all cases we will get useless identity networks (input = output, without any form of transformation or decomposition). To avoid this, sparse driver data is added to the feedback input. The sparse drive can take the form of threshold filtering, so that only certain errors are backpropagated for training, while other errors are ignored as 0 and not used for backpropagation. This is much like a pulsed neural network (not all neurons fire all the time).

Marc ‘Aurelio Ranzato, Christopher Poultney, Sumit Chopra, And Yann LeCun. “Efficient Learning of Sparse Representations in an Energy-based Model.” Proceedings of NIPS. 2007.

11. Variational self-coding machine (VAE)

VAE: Variational autoencoders have the same architecture as AE, but are taught something different: an approximate probability distribution of the input samples, which makes it more similar to BM and RBM.

However, VAE relied on Bayesian inference and independence and re-parametrisation for different representations. Inference and independence are intuitive, but rely on complex mathematical theories. The rationale: Take the impact into account. If one thing happens in one place and another thing happens in another place, they are not necessarily connected. If they are not correlated, then error propagation should take this into account. This is a useful approach, because neural networks are very large diagrams, and if you can eliminate some of the influences at some nodes from other nodes, as the depth of the network increases, it becomes very useful.

Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

12. Denoising self-coding machine (DAE)

DAE (Denoising autoencoders) are self-encoders that are trained not only to input data but also to add noise data (as if to make the image more ambiguous).

But in the calculation of error with automatic coding machine, noise reduction automatic coding machine output is also compared with the original input data. This form of training is designed to encourage the noise reduction autoencoder not to learn details, but to learn more macroscopic features, because subtle features are affected by noise, and the resulting model performance is always poor.

Vincent, Pascal, Et al. “Extracting and composing robust features with denoising autoencoders.” Proceedings of the 25th International Conference on Machine learning. The ACM, 2008. Machinelearning.org/archive/icm…

13. Deep Belief Network (DBN)

DBN (Deep Belief networks) gets its name because it is almost a pile of constrained Boltzmann machines or variational self-coding machines.

Practice has shown that it is very effective to train this type of neural network layer by layer, so that each self-encoder or constrained Boltzmann machine only needs to learn how to encode the output of the previous neuron layer. This training technique is also known as greed training, where greed means getting a fairly good (but perhaps not globally optimal) solution by constantly obtaining a local optimal solution. It can be trained by comparing divergence algorithms or back-propagation algorithms, and it will slowly learn to represent the data in a probabilistic model, like a conventional self-coding machine or a constrained Boltzmann machine. Once unsupervised learning, training or convergence has been achieved to a stable state, the model can be used to generate new data. If trained with the contrastive divergence algorithm, it can even be used to distinguish between existing data because those neurons have been primed to pick up different specifi cations of the data.

Bengio, Yoshua, Et al. “Greedy layer-wise training of deep networks.” Advances in Neural Information Processing Systems 19 (2007): 153. The cca shut. The nips. Cc/paper / 3048 -…

14. Convolutional Neural Networks (CNN)

Convolutional neural networks (CNN) or deep Convolutional neural networks (DCNN) are greatly different from other types of neural networks. They are mainly used for processing image data, but can be used for processing other forms of data, such as voice data. For convolutional neural networks, a typical application is to feed it an image, and then it will give a classification result. That is, if you give it an image of a cat, it prints “cat”; If you give an image of a dog, it prints “dog”.

Convolutional neural networks start with a data scan layer, and this form of processing does not attempt to parse the entire training data at the outset. For example, for a 200X200 pixel image, you don’t want to build a 40,000 node neuron layer. Instead, build a 20X20 pixel input scan layer, and then enter the 20X20 pixel image of the first part of the original image (usually starting at the top left of the image) into this scan layer. When this part of the image (presumably for training with the convolutional neural network) is processed, you move on to the next part of the 20X20 pixel image: gradually (usually by moving a pixel, but at a step that can be set) move the scan layer to process the raw data.

Note that you are not moving the scan layer 20 pixels at a time (or any other scale of the scan layer size), nor are you slicing the original image into 20X20 pixel chunks, but rather sliding the scan layer over the original image. This input data (a 20X20 pixel image block) is then fed into the convolutional layer, as opposed to the conventional neural cell layer, where the nodes are not fully connected. Each input node will only connect to the nearest neuron node (how close depends on the implementation, but usually there are no more than a few).

These convolutional layers become smaller with increasing depth: in most cases, they shrink by a factor of the number of input layers (e.g., 20 neurons, followed by 10 neurons, followed by 5 neurons). 2 to the n (32, 16, 8, 4, 2, 1) is also a very common factor because they are neatly and completely divisible by definition. In addition to the convolution layer, pooling layer is also very important.

Pooling is a way to filter out details: a common way to pool is to maximise pooling, such as taking 2X2 pixels and passing the largest of the four pixels. In order for the convolutional neural network to process the speech data, the speech data needs to be segmented and input section by section. In practical applications, a feedforward neural network is usually added to the convolutional neural network to further process the data, so as to carry out a higher level of nonlinear abstraction of the data.

LeCun, Yann, et al. “Gradient-based Learning Applied to Document Recognition.” Proceedings of the IEEE 86.11 (1998): The 2278-2324. Yann.lecun.com/exdb/publis…

15. Deconvolution Network (DN)

Deconvolutional networks (DN), also known as the inverse graphics networks, are backward convolutional neural networks.

Imagine feeding a neural network the word “cat” and generating a cat-like image that could be trained by comparing it to a picture of a real cat. Like regular CNN, DN can be used in combination with FFNN, but there is no need to reinterpret the new abbreviation. They can be called deep deconvolution networks, but putting FFNN in front of DNN is different than putting it after DNN, and that’s two architectures (and therefore two names), and you might argue about whether you need two different names. It is important to note that most applications do not input text data directly into the neural network, but instead use binary input vectors. For example, <0,1> is a cat, <1,0> is a dog, and <1,1> is a cat and a dog.

The pooling layer of CNN is often replaced by the corresponding inverse operation, mainly interpolation and extrapolation (based on the basic assumption that if a pooling layer uses maximum pooling, you can generate some data smaller than the maximum value in the inverse operation).

Zeiler, Matthew D. et al. “Deconvolutional Networks.” Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010. www.matthewzeiler.com/pubs/cvpr20…

Deep Convolutional Inverse Graph Network (DCIGN)

DCIGN: Deep convolutional Inverse Graphics Networks, which is misleading because they are actually VAE, but use CNN and DNN as encoding and decoding parts respectively.

These networks try to probabilistically model “features” as they code, so that you can take a single picture of a cat and a dog and get them to generate a picture of a cat and a dog. In the same way, you can type in a picture of a cat, and if the cat is next to an annoying neighbor dog, you can ask them to remove the dog. Many demonstrations have shown that this type of network can learn complex image-based transformations, such as changes in light intensity and rotation of 3D objects. Backpropagation algorithms are commonly used to train such networks.

Kulkarni, Tejas D., “Deep convolutional Graphics Network.” Advances in Neural Information Processing Systems. 2015. Arxiv.org/pdf/1503.03…

17. Generative Adversarial Network (GAN)

Generative adversarial networks (GAN) are a different kind of network with a “twin” : the two networks work together.

GAN can be composed of either two networks (but typically FF and CNN), with one for generating content and the other for authenticating generated content.

The discriminating network receives both training data and generative network data. The identification of network accuracy is used as part of the generation of network error. This creates a competition: discriminator networks are getting better at distinguishing real data from generated data, and generative networks are getting better at generating data that is hard to predict. This works, in part, because even though fairly complex noise-like patterns are ultimately predictable, generated data with very similar characteristics to the input data can be hard to distinguish.

Training gans is challenging because you not only have to train two neural networks (each of which has its own problems), but you also have to balance the mechanics of both. If the prediction or generation outperforms the other, the GAN will not converge because it will diverge internally.

Goodfellow, Ian, Et al. “Generative Adversarial Nets.” Advances in Neural Information Processing Systems. 2014. arxiv.org/pdf/1406.26…

18. Recurrent Neural Network (RNN)

“Recurrent neural networks” (RNN) are feedforward neural networks with temporal connections: they have states, they have temporal connections between channels. The input information of the neuron includes not only the output of the previous neuron layer, but also its own state in the previous channel.

This means that your input order will affect the neural network’s training results: typing “milk” and then “cookie” may yield different results than typing “cookie” and then “milk.” One big problem with RNN is gradient extinction (or gradient explosion, depending on the activation function used), where information disappears quickly over time, just as FFNN loses information with increasing depth.

Intuitively, this isn’t a big deal, because these are just weights, not the state of the neuron, but the weight over time comes from the storage of past information; If the weight is 0 or 1000000, the previous state is no longer informative.

In principle, RNNS can be used in many fields, because most data does not have time-line variations in form (unlike speech or video) and can be presented in some sort of sequence. An image or a piece of text can be typed one pixel at a time or one text at a time, so time-dependent weights describe what happened in the previous step in the sequence, not how many seconds ago. In general, recurrent neural networks are good choices for speculating or completing information, such as automatic completion.

“Finding Structure in Time.” Cognitive Science 14.2 (1990): 179-211. crl.ucsd.edu/~elman/Pape…

19. Long and Short-term Memory (LSTM)

LSTM (Long/Short Term Memory) networks attempt to solve the gradient extinction/explosion problem by introducing gate structures and well-defined memory units.

This is more inspired by circuit design than any biological memory mechanism. Each neuron has a memory unit and three gates: an input gate, an output gate and a forgetting gate. The function of these three gates is to protect information by prohibiting or allowing the flow of information.

The input gate determines how much information from the previous neuron layer can remain in the current memory cell, and the output layer at the other end determines how much information the next neuron layer can obtain from the current neuron. Forgetting gates may seem strange at first, but sometimes it’s useful to forget parts of information: if you’re learning a book and you’re starting a new chapter, it might be necessary to forget parts of the previous chapter.

It turns out that LSTM can be used to learn complex sequences, like writing like Shakespeare, or creating entirely new music. It is worth noting that each gate has a weight assigned to the memory unit of the preceding neuron, thus requiring more computational resources.

Hochreiter, Sepp, and Jurgen Schmidhuber. “Long short-term Memory.” Neural Computation 9.8 (1997): The 1735-1780. Deeplearning.cs.cmu.edu/pdfs/Hochre…

20. Door Cycle Unit (GRU)

Gated Recurrent Units (GRU) is a lightweight variant of LSTM. They have one less gate and are connected in a slightly different way: they use an Update gate instead of the input, output and forget gates used by LSTM.

The update gate determines how much information from the previous state is retained and how much information is received from the previous nerve cell layer. A reset gate has a similar function to the LSTM forgetting gate, but its location is slightly different. They always output the full state, with no output gate. In most cases, they are similar to LSTM, but the biggest difference is that GRU is faster and easier to run (but less expressive of functions).

In practice, the advantages and disadvantages here cancel each other out: when you need a larger network to gain functional expressivity, which in turn negates the performance advantages. The comprehensive performance of GRU is better than that of LSTM when no extra function expression is required.

Chung, Junyoung, Et al. “Empirical Evaluation of Gated neural networks on sequence Modeling.” arXiv Preprint arXiv:1412.3555 (2014). Arxiv.org/pdf/1412.35…

21. Neural Turing Machine (NTM)

The NTM (Neural Turing Machines) can be understood as an abstraction of the LSTM, which attempts to de-blackbox Neural networks (to see the details of what happens inside them).

NTM does not design memory units inside neurons, but isolate them. NTM attempts to combine the high efficiency and permanence of conventional digital information storage with the efficiency and function expression capability of neural networks. The idea is to design a content-addressable memory bank and let a neural network read and write to it. The “Turing” in NTM’s name implies that it is Turing complete, that is, capable of reading, writing, or modifying state based on what it reads, that is, of expressing anything a universal Turing machine could.

Graves, Alex, Greg Wayne, “Neural Turing Machines.” arXiv Preprint arXiv:1410.5401 (2014). arxiv.org/pdf/1410.54…

22. BiRNN, BiLSTM, BiGRU

BiRNN: Bidirectional Recurrent neural Networks, BiLSTM: Bidirectional Long/Short Term Memory Networks and Bidirectional gated loop units (BiGRU: Bidirectional gated recurrent units are not shown in the chart because they look the same as their corresponding one-way neural network structure.

The difference is that these networks are not only connected to past states, but also to future states. For example, the one-way LSTM can be trained to predict “fish” by entering letters one by one (the loop links on the timeline remember past state values). BiLSTM’s feedback path inputs the next letter in the sequence, which allows it to know what the future information is. This form of training allows the network to fill in the gaps between messages rather than predict them. So instead of extending the boundaries of an image, it fills in the gaps in an image.

Schuster, Mike, And Kuldip K. Paliwal. “Bidirectional Neural Networks in Recurrent Cities.” IEEE Transactions on Signal Processing 45.11 (1997): The 2673-2681. www.di.ufpe.br/~fnj/RNA/bi…

23. Deep Residual Network (DRN)

Deep residual networks (DRN) are very Deep FFNN networks that have special connections to transmit information from one layer of nerve cells to subsequent layers (usually layers 2 to 5).

The purpose of the network is not to find mappings between input data and output data, but to build mapping functions between input data and output data plus input data. Essentially, it adds an identity function to the result, along with the previous input as the new input at the next level. The results show that the network is very good at learning when the number of layers exceeds 150, which is much more than the normal two to five layers. However, there is evidence that these networks are essentially just RNNS without a time structure, and they are always compared to LSTM without a gate structure.

He, Kaiming, Et al. “Deep residual learning for image recognition.” arXiv preprint arXiv:1512.03385 (2015). arxiv.org/pdf/1512.03…

24. ESN

Echo State Networks (ESN: Echo State Networks) is a different type of (circular) network.

It is different in that the connections between neurons are random (there is no uniform layer of nerve cells) and the training process is different. Instead of inputting data and propagating errors back, ESNs input data first, feed forward, then update the state of neurons, and finally observe the results. Its input and output layers play a less conventional role here, with the input layer dominating the network and the output layer acting as an observer of the activation patterns that unfold over time. During training, only the connection between observing and hiding units is changed.

Jaeger, Herbert, Harald Haas. “Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. “Science 304.5667 (2004): The 78-80. Pdfs.semanticscholar.org/8922/17bb82…

25. Extreme Learning Machine (ELM)

Extreme Learning Machines (ELM: Extreme Learning Machines) are essentially FFNN with random connections.

They are very similar to LSM and ESN in that they do not use back propagation in addition to cyclic and pulse properties. Instead, they assign random values to the weights and then train the weights once by least square fitting (with the least error of all functions). This makes ELM’s function fitting ability weak, but it runs much faster than back propagation.

Cambria, Erik, et al. “Extreme Learning Machines [Trends & Experiment].” The IEEE Intelligent Systems, 28.6 (2013) : 30 to 59. www.ntu.edu.sg/home/egbhua…

26. Liquid Machine (LSM)

Liquid State machines (LSM) are the same as esNs.

The difference lies in that LSM is a kind of spiking neural networks, which replaces sigmoid activation functions with threshold functions. Each neuron is also a memory unit with the property of accumulation. Therefore, when the state of a neuron is updated, its value is not the sum of adjacent neurons, but the sum of its own state value. Once it reaches the threshold, it releases energy to other neurons. This creates a pulse like pattern: the neuron does nothing until it reaches a threshold.

Maass, Wolfgang, Thomas Natschlager, and Henry Markram. “Real-time Computing without Stable States: A new framework for neural computation based on perturbations. “Neural Computation 14.11 (2002): The 2531-2560. Web.archive.org/web/2012022…

Support Vector Machine (SVM)

SVM (Support Vector Machines) can find optimal solutions for classification problems.

Traditionally, they can only deal with linearly separable data; Find out which picture is Garfield and which is Snoopy, and you can’t do anything else.

In the training process, SVM can be understood as: plot all the data (Garfield and Snoopy) on a plane chart first, and then find the line that can best distinguish the two types of data points. This line can divide the data into two parts, one side of the line is full of Snoopy, the other side of the line is full of Garfield. Then move and optimize the line to maximize the distance between the data points on both sides of the line. Categorize the new data, draw the data point on the chart, and see which side of the dividing line the data point is on (Snoopy’s side or Garfield’s side).

By using the kernel method, SVM can be used to classify data in n-dimensional space. This leads to plotting data points in 3d space, which allows THE SVM to distinguish Between Snoopy, Garfield and Simon, and even to classify more cartoon characters in higher dimensions. SVM is not always regarded as a neural network.

Cortes, Corinna, and Vladimir Vapnik. “Support-Vector Networks.” Machine Learning 20.3 (1995): 273-297. The image, diku dk/imagecanon /…

28. Kohonen network

Finally, a look at the Kohonen network (KN, also known as a self-organising (feature) map (SOM/SOFM: Self organising (feature) Map).

KN uses competitive learning to classify data without supervision. The neural network is given an input, and it evaluates which neuron best matches that input. This neuron then continues to adjust to better match the input data, while firing neighboring neurons. The distance that adjacent neurons move depends on the distance between them and the best matching unit. KN is also sometimes not considered a neural network.

References: Kohonen, Teuvo. “Self-organized formation of Topologically correct feature maps.” Biological Cybernetics 43.1 (1982): 59-69. Cioslab.vcu.edu/alg/Visuali… Shanghai is school python artificial intelligence, welcome to read this article, the author: the original AI100: https://www.toutiao.com/i6432188985530909186