By Chen Qiang, Tubi Senior Machine Learning Engineer

preface

In order to make the team grow better, our Tubi China team officially launched the PLAN of TTT (Tubi Talent Time) some Time ago. There will be a sharing every two weeks, which can be business-related or business-related, or technology-related or technology-irrelevant. We have very good engineers in various fields, and a previous TTT was on how to read and write memory like a hard disk, called ramdisk, which is a code base that can use memory like a hard disk to increase the speed of reading and writing, by implementing the hard disk read and write protocol to achieve this effect. Colleagues also optimized it, can use 15G of memory to quickly process hundreds of GIGABytes of files on the network. I have done some research on deep learning in graduate school. Last week, I was lucky to be invited to speak about deep learning. Here I make a text summary, hoping that more people can benefit from it.

The outline

The outline

The goal of this article is to give you an idea of how deep learning works, and to give you some practical code to understand how it is implemented. The code examples here are based on Lua and Torch. Deep learning is a method of machine learning. In order to better understand deep learning, here are some important concepts of machine learning. Here are some examples of classification and regression in machine learning or deep learning. Finally, the back propagation algorithm based on chain rule in deep learning is introduced. Deep learning resources, books, videos and deep learning frameworks are also listed.

What is deep learning

What is Deep learning, name Doesn’t matter

According to Wikipedia, machine learning covers a lot of content: classification, clustering, regression, reinforcement learning, supervised learning, unsupervised learning, etc. Deep learning is one of the machine learning methods, which can also be used to solve many problems. In addition to deep learning, we hear a lot of terms: artificial intelligence, data mining, pattern recognition, computer neural networks, and so on. In many cases, these nouns may all refer to the same thing, and it’s not so much the nouns as the fact that we know what these things do and how they work. So more importantly, you need to know what deep learning works. Before we do that, let’s summarize how machine learning works.

Machine learning: Four important parts

There are four important parts of machine learning: Digital Input, Digital Output, Criterion and Mapping

In order to let the data can be stored, we stored objects have to be digital, eyes to see the scenery, for example, could save into RGB image format, the sound, save into MP3 format, of course, for the same things can have a variety of other types of formats, all kinds of formats there are a variety of different purposes. Similarly, in order to solve the problem of classification or regression, we need to digitize the object of classification or regression and the target of classification or regression, so that our mapping can be easily converted from digital input to digital output. In this process, we need to have a standard or Criterion to measure the quality of the mapping so that we can choose the best mapping based on that Criterion.

Examples of machine learning in four parts: Recognizing whether an image is a hot dog

For example, classifying pictures to determine whether they are hot dogs.

  • Digital Input: In order for the mapping to work, the image can be represented in RGB with a 3 x {height} x {width} matrix
  • Digital Output: To represent the result, whether it is a hot dog or not, the two-dimensional vector [0, 1] can be used to represent a hot dog, and [1, 0] can be used to represent a hot dog, and 0 can be used to represent a hot dog.
  • Criterion: Simply compare absolute values of numbers in the same position by mapping the result to the actual result, the smaller the result, the better the mapping effect

Four parts of machine learning examples: Intelligent people as a mapping, do not need to pay attention to Digital Input, Digital Output and Criterion

Of course the choice of Digital Input, Digital Output, and Criterion depends largely on the design of the mapping. If there is a man with intelligence in the role of the mapping, in fact, there are some company have done, for verification code recognition task, we don’t need to the authentication code task particularly digital processing, the only need to print the picture on the screen, let people can see with intelligence, smart other people can see letters directly recorded somewhere, No special digitalization of these outputs is required.

How to realize the four parts of machine learning in the example of housing price prediction

Digital Input for house price prediction: Houses contain a lot of information. For the purpose of simplification, three important information are selected to represent a house: beds represent the number of beds, baths represent the number of bathrooms, and area represents the area of the house. Digital Output: House prices expressed in dollars. Mapping: The house price is predicted by multiplying the sum of the weights by the numbers of each dimension, where the weights are unknown. Different weights can be selected to achieve different Mapping effects. Criterion: Use the absolute difference between the result of the mapping and the real result to measure the quality of the mapping. If you still don’t know what’s going on, you can also refer to zhihu’s 4 Key Parts of Machine Learning.

You can express cos (t) as a function of weights

It can be seen here that cost can be expressed as a function of weights, so how to adjust these weights to achieve the purpose of reducing cost? And to do that, we’re going to simplify the function of cos (t) to make it easier to do that.

Simplified expression of cos (t)

I’m simplifying cos (t) to a function of w1, and just looking at the picture it’s easy to see that when W1 is negative 1, cos (t) has a minimum. Of course, different W1 can be randomly tried to find the w1 that can minimize the cost. However, if the range of options of W1 is large and there are many numbers to enumerate, this method of violent enumeration will be very time-consuming. In order to realize automatic and effective computer calculation, because we know the derivative of cost with respect to w1, for any given value of w1, we can use the result of the derivative, if the derivative is positive, it means that if we reduce W1 by a little bit, cost will also decrease. So cosine of t can be reduced by the derivative, as shown in the following example.

Find the right number (weight)

W1 is chosen at random. The example here is w1=0.5, _1=0.5. When w1=0.5, the derivative of its cost with respect to w1 is 6, which is positive. I can turn w1 into the desired negative 11.

Machine learning optimization: A real example

Machine learning optimization: Real-world examples, network details

Here is a simple code implementation example. Click read to see the slideshow with links to the code. Or just go to the code link: t.cn/RgJAQ6B. 17. Digital Output: a two-dimensional vector containing x1, x2x1, and x2 Criterion: w1 \times x1 + w2 \times x2 + b The square of the difference between the result of the mapping and the real result is used to measure the quality of the mapping. The forward method in Torch is a method of taking the result of the mapping, or the Criterion method, and the backward method is a method of calculating the derivative, and the derivative result is placed in a special place. For more details, please refer to “Gradient Descent method and Derivation” in Zhihu column.

Deep learning networks of all kinds

In the previous example, we saw that in order to optimize the unknowns in each module to reduce cost, we need the derivative of cost to each module and each unknown, because it can be used to guide us to change these unknowns to achieve the purpose of reducing cost. Deep learning can superimpose multiple maps together and conduct multiple mapping processing on the input. Theoretically, the superposition of multiple maps will have higher expression ability than a single map, and the cost can be reduced after the optimization of the mapping. Many current networks can reach hundreds or even thousands of layers. So here’s an example of how the derivative chain rule can be used to help cost differentiate with respect to variable parameters in each mapping.

The chain rule, cosine of t with respect to the variable in f

The chain rule, the f function

The chain rule, g function

Take the derivative chain rule, k function

Take the derivative chain rule, criterion function

There are three mappings f, g and K, when calculating the derivative of cost to picture W, use the derivative result of f’ to its parameter picture, g’ to f’ f’ (that is, the derivative result of g module’s output to input), And the derivative of y prime with respect to g prime, the output of k with respect to the input, and the derivative of cos (t) with respect to y prime.

The chain rule, cosine of t with respect to the variable in g

In addition, when calculating the derivative of cost to the picture, the derivative of output to input of similar modules is also used.

In general, as long as the module (map) provides the derivative of the output with respect to the input, as well as the derivative of the output with respect to the number in the module (its name can also be unknowns, weights, parameters, etc.), it can be inserted into our model, become part of the model, and optimize the whole model using the chain rule. Click read to see the slideshow with links to the code. Or open this code link: Tc n/RgJ2bly. So-called deep learning an alchemy is the process of design or create deep learning network, try adding different mapping in the network, the mapping can also be called modules, such as the creation of different network, using the gradient descent method of network optimization, to observe what kind of structure can achieve the best effect, It’s like alchemy, which is a combination of different raw materials, to see which combination can achieve unexpected results. The word alchemy was mentioned at a conference, and it became popular in the field of deep learning.

The process of creating a network is also a bit like stacking wood, where small maps are building blocks. Different stacking methods create different network structures. Building blocks can also be created by themselves, or new building blocks can be formed by combining original building blocks. The process of learning deep learning is like the process of stacking wood. With more experience in stacking, you will know how to design the network to be effective in certain situations.

Learn more

Deep learning materials
  • Books: Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
  • Video tutorial:
    • Machine Learning Basics, highly recommended if you don’t have a foundation in machine learning, Andrew Ng’s machine learning course on Cousera has very well designed assignments.
    • Deep learning, deep learning has had a huge impact on computer vision. The best course I found was the Stanford cs231N course. There are a lot of great homework materials on the Stanford course home page, and you can only find the course videos on youtube.com. If you do the homework carefully in this course, you will get a lot. If you do it all on your own, you will understand at least 80% of the applications of deep learning in computer vision for a master’s degree in vision studies.
  • Code exercises: Torch based on Lua can be used for research, some academic authors have published code based on Torch, and Pytorch based on Python can be used for industrial use. Both frames are elegantly designed.

The resources

[Video]Lecture 4 | Introduction to Neural Networks, Backpropagation, and Neural Networks t.cn/RgpLEoW

[Slides]Lecture 4: Backpropagation and Neural Networks t.cn/RdSvvmz

Torch | Developer Documentation, Define your own layer t.cn/RgpyUCd

Zhihu: Machine Learning and Mathematics — Banana t.cn/RgpyVSZ