Original text: theaisummer.com/Deep-Learni…
By Sergios Karagiannakos
Because the article is a little long, so it will be divided into two articles, divided into development.
Link to previous article:
A Brief Review of Deep Learning Algorithms (I)
The contents of this article are as follows:
- What is deep learning?
- Neural Networks
- Feedforward Neural Networks (FNN)
- Convolutional Neural Networks (CNN)
- Recurrent Neural Networks (RNN)
- Recursive Neural Network
- AutoEncoders
- Deep Belief Networks and Restricted Boltzmann Machines
- Generative Adversarial Networks
- Transformers
- Graph Neural Networks
- Natural language Processing based on deep learning
- Word Embedding
- Sequence Modeling
- Computer vision based on deep learning
- Localization and Object Detection
- Single shot detectors(SSD)
- Semantic Segmentation
- Pose Estimation
The previous article introduced the first six sections, from the definition of deep learning to neural network, forward neural network, convolutional neural network, cyclic neural network and recursive neural network. Next, it will introduce the remaining algorithms and two major application directions.
7. AutoEncoders
Autoencoder [11] is usually used as an unsupervised algorithm, and is mainly used in reduction and compression. The trick is to try to make the output equal the input, and in other things, to try to reconstruct the data.
An autoencoder consists of an encoder and a decoder. The encoder takes an input, encodes it as a vector in a low-dimensional hidden space, and the decoder is responsible for decoding that vector to the original input. The structure is shown in the figure below:
From the figure above, we can see that a feature representation of the input with fewer dimensions can be obtained from the output in the middle of the network (code in the figure), which is the work of reducing and compressing.
In addition, the idea can be used to retrieve slightly different input data, or even better data, which can be used for training data enhancement, data denoising, etc
8. Deep Belief Networks and Restricted Boltzmann Machines
Constrained Boltzmann machine [12] is a random neural network with generative ability, that is, it can learn a probability distribution through input. Compared with other networks, its biggest characteristic is only input and hidden layer, no output.
In the forward part of the training, an input is passed in and a corresponding feature representation is generated, and in the back propagation, the original input is reconstructed from this feature representation (this process is very similar to an autoencoder, but it is implemented in a single network). The specific network structure is shown in the figure below:
Multiple constrained Boltzmann machines (RBMs) can be superimposed together to form a deep belief network [13]. They look very similar to the full connection layer, but differ in the way they are trained. The training of deep belief network is to train its network layer in pairs according to the training process of RBMs.
Recently, however, deep belief networks and restricted Boltzmann machines have been used less and less because of the emergence of generative adversance networks (GANs) and variant autoencoders.
Generative Adversarial Networks
Generative adjective networks [14] is an algorithm proposed by Ian Goodfellow in 2016, based on this simple but elegant idea: if you wanted to generate image data, what would you do?
It might be possible to create two models, train the first model to generate false data (generators), train the second model to distinguish between true and false data (discriminators), and train them together to compete with each other.
As you train, the generator gets better and better at generating image data, and its ultimate goal is to successfully fool the discriminator. The discriminator is increasingly capable of distinguishing between real and fake data, and its ultimate goal is not to be cheated. The result is that the discriminator gets very real fake data, and the network structure is shown below:
Applications that generate adversarial networks include video games, astronomy pictures, fashion and more. Basically, as long as the image data, it is possible to use the generative adversarial network, such as the very famous Deep Fakes, only have the generative adversarial network.
10. Transformers
Transformers [15] is also a very new algorithm, mainly applied in language applications, and gradually replaced the circular neural network. It is primarily based on the attention mechanism, which forces the network to focus on a particular data point.
Instead of having complex LSTM units, the attention mechanism weights different parts of the input data according to their importance. Attention mechanism [16] is also a kind of weight layer, whose purpose is to give priority to a specific part of the input by adjusting the weight, while temporarily ignoring other parts that are not important.
The Transformers incorporate several stacked encoders (comprising the encoder layers), several stacked decoders (decoding layers), and many layers of the attention network (self-attentions and encoder-decoder attentions), as shown below:
The Transformers are mainly used to solve ordered sequence data, such as some tasks of natural language processing, including machine translation and text summarization. At present, BERT and GPT-2 are the two best pre-trained natural language systems, which are used in many natural language processing tasks. They are also based on Transformers.
11. Graph Neural Networks
Generally speaking, unstructured data is not a good candidate for deep learning algorithms. In fact, there are many applications in the real world where data is unstructured and then organized in graph format. Such as social networks, chemical compounds, knowledge graphs, spatial data, etc.
The goal of graph neural network [17] is to model graph data, that is, to recognize the relationship between nodes in a graph and generate a numerical representation data, similar to an embedding vector. Therefore, they can be applied to other machine learning models for all types of tasks, such as clustering, classification, and so on.
Natural language processing based on deep learning
Word Embedding
Word embedding is to obtain semantic and grammatical similarities between words by transforming them into numerical vector representations. This is necessary because neural networks can only accept numeric data, so words and text must be encoded as numbers.
- Word2Vec[18] is the most commonly used method, which tries to learn embedding vector and can predict the current word (CBOW) or skip-gram based on the word. In fact
Word2Vec
It’s also a two-layer neural network, and the inputs and outputs are words. Words are input into the neural network using one-hot encoding. inCBOW
Where the input is the adjacent word and the output is the desired word, while inSkip-Gram
In the example of “, the inputs and outputs are reversed, the inputs are words and the outputs are context words. - Glove[19] is another model
Word2Vec
Is based on matrix decomposition methods such as Latent Semantic Analysis, which has been proven to work well at global text Analysis, but cannot capture local context information. By combiningWord2Vec
And matrix factorization can be well used to their respective advantages. - FastText[20] is an algorithm proposed by Facebook that takes character-level representations instead of words.
- Context word embedding(Contextual Word Embeddings) by using a recurrent neural network instead
Word2Vec
, used to predict the next word of a word in a sequence. This method captures long-term independence between words, and each vector contains information about current and historical words. The most famous version isELMo[21], it is a two-layer bidirectional LSTM network. - Attentional mechanism[22] and
Transformers
As introduced earlierTransformers
Said, gradually replacedRNN
They give weight to the most relevant words and forget the less important ones
Sequence Modeling
Sequence modeling is an integral part of natural language processing because it appears in a large number of common applications, such as machine translation [23], speech recognition, auto-completion and emotion classification. The sequence model can process sequence inputs, such as all the words of a document.
For example, suppose you want to translate a sentence from English into French.
To achieve this translation, you need a sequence model (SEq2seq) [24]. The Seq2seq model consists of an encoder and a decoder. The encoder takes a sequence (English sentence in this example) as input, and then takes as output a representation of the input in hidden space, which is input to the decoder and outputs a new sequence (French sentence).
The most common encoder and decoder constructs are cyclic neural networks (mostly LSTMs), because they are very good at capturing long-term independence, while the Transformers model is faster and easier to parallelize. Sometimes, convolutional neural networks are used to improve accuracy.
BERT[25] and GPT-2 [26] are considered to be the two best language models at present, and they are actually Transformers based on sequence models
Computer vision based on deep learning
Localization and Object Detection
Image positioning [27] refers to the task of locating an object in a picture and marking it with a boundary box, and object classification is also included in target detection.
These related efforts are handled through a basic model (and its upgraded version) called RCNN. R-cnn and its upgraded versions Fast RCNN and Faster RCNN adopt region proposals and convolutional neural networks.
In the case of Faster RCNN, an external system of the network gives alternate regions in the form of fixed-size bounding boxes that may contain target objects. These boundary boxes will be classified and corrected by a CNN (such as AlexNet), so as to determine whether the region contains objects, what category the objects are, and correct the size of the boundary boxes.
Single shot detectors(SSD)
Single-shot detectors and one of its most famous prototypes, YOLO(You Only Look Once)[28], do not have the idea of an alternate area, but rather a set of predefined bounding boxes.
These boundary boxes will be transmitted to CNN and predicted to obtain a confidence score respectively. Meanwhile, objects in each box will be detected and classified. Finally, only one boundary box with the highest score will be retained.
Over the years, several versions of YOLO have been upgraded –YOLOv2, YOLOv3, and YOLO900 have improved speed and accuracy, respectively.
Semantic Segmentation
A basic work in computer vision is to classify every pixel in an image based on context, namely semantic segmentation [29]. In this field, the two most commonly used models are Fully Convolutional Networks (FCN) and U-Nets.
- **Fully Convolutional Networks (FCN) ** is an encoder-decoder network structure, that is, a network containing convolution and deconvolution. The encoder first downsamples the input image to capture semantic and contextual information, while the decoder upsamples the input image to recover spatial information. In this way, the context of an image can be recovered with less time and space complexity.
- U-NetsIt’s based on a unique ideaSkip-connections. Its encoder and decoder have the same size,
skip-connections
Information can be passed from the first layer to the last, increasing the dimensional size of the final output.
Pose Estimation
Pose estimation [30] is the key point of a person in a bitmap or video, which can be 2D or 3D. In 2D, we estimate the coordinates of each node (x, y), while in 3D, the coordinates are (x, y, z).
PoseNet[31] is the most commonly used model in this field [31], which also uses convolutional neural network. The image is input into CNN, and then the single pose or multi-pose algorithm is used to detect the pose. Each pose will get a confidence score and some key point coordinates, and finally only one with the highest score will be retained.
conclusion
That is all, in a very brief introduction to some of the algorithms commonly used in deep learning, including convolutional neural networks, recurrent neural networks, autoencoders, and generative adversarial networks, Tranformers, which have been developed in recent years, as well as their applications respectively. Common directions in natural language processing and computer vision.
Of course, this paper is only a very simple popularization of these algorithms and application directions, if you want to further understand, you can refer to the reference link, which will introduce each specific algorithm model in more detail.
reference
- En.wikipedia.org/wiki/Deep_l…
- karpathy.github.io/neuralnets/
- Brilliant.org/wiki/backpr…
- Ruder. IO/optimizing -…
- Theaisummer.com/Neural_Netw…
- Theaisummer.com/Neural_Netw…
- Theaisummer.com/Self_drivin…
- Theaisummer.com/Sign-Langua…
- www.coursera.org/lecture/nlp…
- Theaisummer.com/Bitcon_pred…
- Theaisummer.com/Autoencoder…
- Towardsdatascience.com/restricted-…
- Deeplearning.net/tutorial/DB…
- Theaisummer.com/Generative_…
- Ai.googleblog.com/2017/08/tra…
- Lilianweng. Making. IO/lil – log / 201…
- Theaisummer.com/Graph_Neura…
- Pathmind.com/wiki/word2v…
- medium.com/@jonathan_h…
- Research.fb.com/blog/2016/0…
- allennlp.org/elmo
- Blog.floydhub.com/attention-m…
- www.tensorflow.org/tutorials/t…
- Blog. Keras. IO/a – ten – minut…
- Github.com/google-rese…
- Openai.com/blog/better…
- Theaisummer.com/Localizatio…
- theaisummer.com/YOLO/
- Theaisummer.com/Semantic_Se…
- Theaisummer.com/Human-Pose-…
- Github.com/tensorflow/…
- www.fritz.ai/pose-estima…