Deep Learning Applications for Everyone: Getting Started (Part 2)

Four, classic introduction demo: Handwritten digit recognition (MNIST)

While the usual starting point for programming is the Hello World program, the starting point for deep learning is MNIST, a program that recognizes handwritten numbers in 28-by-28-pixel images.

MNIST data and official website:

yann.lecun.com/exdb/mnist/

Deep learning content, will involve more mathematical principles behind it, as a beginner, limited by my mathematics and technology level of the individual, may not be enough to accurately tell relevant mathematical theory, therefore, this article will pay more attention to “applied”, does not carry on the mathematical principles behind, thanks for understanding.

Load the data

The first step of the program execution is of course to load the data. According to the data set we obtained before, it mainly consists of two parts: 60000 training data set (mnist.train) and 10000 test data set (mnist.test). Each of these rows, we have a 2828 is an array of 784, and the essence of an array is to divide 28The 28 pixel picture is converted into the corresponding pixel lattice.

For example, the corresponding matrix transformed from the picture of handwriting 1 is shown as follows:

We’ve often heard that deep learning in graphics requires a lot of computing power, even on an expensive, professional GPU (Nvidia GPU), and we’ve already seen some of the answers in this case. A 784 pixel image has 784 features for the learning model, and our actual photos and images are often hundreds of thousands or millions of levels, so the corresponding basic feature number is also of this order of magnitude. Large-scale calculation based on such an order of magnitude array, without strong computing power support, it is really difficult to do anything. Of course, this introductory MNIST demo can still run fairly quickly.

Key code in Demo (reading and loading data into array objects for later use) :
Build the model

Each MNIST image represents a number, from 0 to 9. And what the model ultimately expects to get is: given a picture, get the probability that represents each number. For example, the model might assume that a picture of the number 9 has an 80% chance of representing the number 9 but a 5% chance of identifying it as an 8 (because both have small circles in the top half), and then give it an even lower chance of representing other numbers.

MNIST’s introductory example uses Softmax Regression, a model that can be used to assign probabilities to different objects.

To get evidence that a given image belongs to a particular numeric class, we weighted the sum of 784 features (individual pixel values in the lattice) of the image. If a feature (pixel value) has strong evidence that this image does not belong to this class, then the corresponding weight value is negative; on the contrary, if a feature (pixel value) has favorable evidence that this image belongs to this class, then the weight value is positive. Similar to the housing price estimation example mentioned above, a weight assignment is made for each pixel.

Suppose we get a picture and need to calculate the probability that it is 8. The mathematical formula is as follows:

In the formula, I represents the number to be predicted (8), the different weight values of 784 features when the predicted number is 8, the bias of 8, and X is the value of 784 features in this picture. Through the above calculation, we can obtain the sum of evidence that the picture is 8, and the Softmax function can convert the evidence into probability Y. (Mathematical principle of Softmax, please inquire relevant information.)

The previous process can be summarized in one chart (from the official source) as follows:

By multiplying and summing different features X and weights corresponding to different numbers, the distribution probability of each number can be obtained. The value with the maximum probability is considered as the prediction result of our picture.

The above process can be written as an equation as follows:

This equation can be expressed very simply in matrix multiplication, then it is equivalent to:

Without expanding the specific value inside, it can be simplified as:

And if we learn a little bit about matrices in linear algebra, in fact, we’ll see that matrix representation is a little bit easier to understand in some cases. If you don’t remember much about matrices, that’s fine, and I’ll add the linear algebra video later on.

With all that said, the key code is just four lines:

The above codes are placeholders for similar variables. The model calculation method should be set first. In the real training process, the source data should be read in batches and continuously filled with data, so that the model calculation can really run. Tf.zeros means that they are uniformly assigned a placeholder of 0 first. X data is read from the data file, while W and B are constantly changed and updated in the training process, and Y is calculated based on the previous data.
Loss functions and optimization Settings

To train our model, we first need to define an indicator to measure whether the model is good or bad. This metric is called cost or loss, and then try to minimize this metric. To put it simply, we need to minimize the value of Loss. The smaller the value of Loss is, the closer our model is to the real result of tag.

The loss function used in the Demo is “cross-entropy”, which is formulated as follows:

Y is our predicted probability distribution, y’ is the actual distribution (which we input), and cross entropy is used to measure the inaccuracy of our predicted results. TensorFlow has a diagram describing each cell, that is, the calculation flow of the whole model. It can automatically use backpropagation algorithm to determine how our weights and other variables affect the loss value we want to minimize. Then, TensorFlow will use the optimization algorithm we set to continuously modify the variables to reduce the loss value.

Among them, Demo uses gradient Descent algorithm to minimize cross entropy at a learning rate of 0.01. Gradient descent algorithm is a simple learning process. TensorFlow only needs to update each variable bit by bit in the direction of decreasing loss value.

The corresponding key codes are as follows:

Remarks:

The cross entropy:Colah. Making. IO/posts / 2015 -…

Back propagation:Colah. Making. IO/posts / 2015 -…

In the code, you’ll see the concept of a one-hot vector and the variable name. This is a very simple thing, which is to set an array of 10 elements, only one of which is 1 and all the others are 0, to represent the label result of the number. For example, the label value representing the number 3: [0,0,0,1,0,0,0,0,0,0,0]

Training operation and model accuracy testing

From the previous implementation, we have set up a computational “flow chart” for the entire model, which becomes part of the TensorFlow framework. So, we can start our training program. The code below means that we can cycle train our model 500 times, taking 50 training samples in batches each time.

The training process is essentially the boot training of the TensorFlow framework, in which Python batches data to the underlying library for processing.

I added two lines of code to the official demo to calculate the accuracy of the current model every 50 times. It is not necessary code, just to facilitate the observation of the process of gradual changes in the recognition accuracy of the whole model.

Of course, the variables involved such as accuracy need to be defined in the previous place:

When we finished training, it was time to verify the accuracy of our model, which was the same as before:

The result of my demo is as follows (softmax regression example runs relatively fast), the current accuracy is 0.9252:
When we first started running official demos, we always wanted to print out the values of the relevant variables to see what format and state they were in. Tensor(“Equal:0”, shape=(? ,), dtype=bool) since it is a placeholder, we must feed it some data so that it can display the real content. Therefore, the correct way to print is usually to add the current input data to it. Run (y, feed_dict={x: batchxs, y: batch_ys})).

In general, the recognition accuracy of 92% is quite disappointing. Therefore, the official MNIST actually has different versions of various models, among which the version of CNN(convolutional Neural network), which is more suitable for image processing, can achieve more than 99% accuracy. Of course, its execution time is also relatively long.

(Note: CNN_mnist. py is the version of convolutional neural network, followed by the download URL with micro cloud disk)

Feed-forward neural network version of MNIST, up to 97% :

Share data and source code on micro cloud: url.cn/44aZOpP (Note: foreign website download is relatively slow, I this download will be relatively fast, in the case of the environment has been built, run. Py can be executed inside)

Five, and the combination of business scenarios demo: to predict whether the user is super membership According to the content of the front, we based on the above softmax just three layer (input, process, output) neural network model has been familiar with, so, whether the model can be applied to our specific business scenario, which is difficult? To verify this, I took some data from the live web to do this experiment.

Data preparation

I captured the user participation data of a live website movie ticket event, including the buttons clicked, mobile phone platform, IP address, participation time and other information. In fact, the user’s identity information is implied in these data. For example, some gift packages can only be claimed with super membership. If the user clicks this button to claim successfully, it can be proved that the user’s identity must be super membership. Of course, I just sorted out the characteristics of these irrelevant data directly as our sample data, and labeled them as super membership.

Sample data format for training is as follows:

The first column is QQ number, which is only used for cognitive identification. The second column indicates whether super membership is used as the label value of training. The following column is IP address, platform flag bit and participation record of participating activities (0 means unsuccessful participation, 1 means successful participation). Then we get an array of 11 features (after some transformation and mapping, we reduce the number that is particularly large) :

[0.91666666666666, 0.4392156862745098, 0.984313725490196, 0.7411764705882353, 0.2196078431372549, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]

The corresponding super data format is as follows, which serves as a label for supervised learning:

Super Member: [0, 1]

Non-super Member: [1, 0]

Here is a special explanation of why data conversion is needed in practical applications. On the one hand, a mapping transformation of the data helps simplify the data model. On the other hand, to avoid the NaN problem, when the value is too large, in some mathematical exponents and division floating-point operations, it is possible to end up with an infinite number, or other overflow situations, which in Python will be of NaN type, which will destroy all subsequent calculations and lead to calculation exceptions.

For example, in the following figure, the characteristic value is too large, which leads to the accumulation of some intermediate parameters in the training process, and finally leads to the generation of NaN value, and the subsequent calculation results are all destroyed:

And the reason for NaN is that in complex mathematical calculations, infinity or infinitesimal. For example, in our demo, NaN is mainly due to softMax calculations.

RuntimeWarning: divide by zero encountered in log

As soon as I started working on the actual business application, IT was frustrating to get very strange results (NaN issues, I found that the application continued), only to find out after several rounds that it was a NaN value problem. Of course, after careful analysis of the problem, it is not found that there is no way to troubleshoot. Because NaN values are peculiar types, NaN can be encoded in the following way! = NaN to check whether NaN occurs during their training.

The key program code is as follows:

Using the above approach, I had no trouble finding the NaN that my deep learning program generated when it learned what batch of data. So, a lot of raw data we’re going to do is divide by something to make it smaller. Official MNIST, for example, does the same thing, dividing the values of 256 pixel colors by 255, making them all a floating point number less than 1.

MNIST also makes feature data smaller when processing pixel feature data of the original image:

The NaN value problem has been bothering me for a while. , especially here, to avoid the entry of students to step on the pit.

The execution result

The training set (6,700) and test set (1,000) I prepared weren’t much data, but the prediction accuracy of super membership ended up being 87 percent. Although the prediction accuracy is not high, which may be related to the data of my training set, the whole model does not take much time. It only takes 2 nights from data sorting, coding, training to the final result.

For example, the model predicts that the first QQ user has an 82% chance of being a non-super member and a 17.9% chance of being a super member (the prediction is accurate).

Through the above example, we will find that for some simple scenarios, we can easily implement.

Other models

Cifar-10 Demo for Recognizing image classification (official)

Classification of the CIFAR-10 dataset is an open benchmark problem in machine learning. It is tasked with classifying a set of 32x32RGB images covering 10 categories: aircraft, cars, birds, cats, deer, dogs, frogs, horses, boats, and trucks.

This is also one of the most important official demos.

More detailed introduction:

www.cs.toronto.edu/~kriz/cifar…

Tensorfly. Cn/tfdoc/tutor…

This example takes a long time to execute and requires patience.

My execution process and results on the machine:

Cifar10_train.py is used for training:

Cifar10_eval.py is used to verify the results:

The low recognition rate is due to the low recognition rate of the official model:

In addition, when I first ran the official example on January 5th, I still had some small problems and could not run it (the latest official version may have been corrected). It is suggested that I can directly use the version I put on the micro cloud (the log in the code and the path to read the file need to be adjusted).

Source code download:url.cn/44mRzBh

The micro-cloud disk does not contain the picture data of training set and test set. However, if the program detects that these pictures do not exist, it will download them by itself:

To see if the SoftMA regression model can learn some of the rules I set up myself, I made a small demo to test it. I constructed a series of data by means of random number generation, and let the previous Softmax regression model learn, and finally see whether the model can pass the learning of the training set, and finally 100% predict whether the sample data is more than 5 years old. Both the model and the data themselves are relatively simple, and the way of constructing the data is as follows: I randomly construct a sample data with only two characteristic latitudes, [year, 1], where year randomly takes the value of 0-10 and the number 1 is put in as interference. If year is greater than 5 years old, the tag is set to [0, 0, 1]; Otherwise, the label is set to: [0, 1, 0].

6000 false training sets were generated to train the model, and finally it could achieve 100% successful prediction accuracy:

Micro cloud download (source download) : url.cn/44mKFNK

The beginning of learning ancient poems based on RNN is very amazing. The demo was made by a researcher in the United States, which can generate ancient poems according to the theme, and the quality of ancient poems is relatively high. Therefore, I also tried to run a model that could write ancient poems on my own machine, and later I found a model based on RNN. RNN Recurrent Neural Networks is one of the most commonly used deep learning models. Based on an external demo, I made some adjustments and ran a relatively simple program to learn and write ancient poems.

The executive writes the poem (let it write ten) : the stasis drops resides in oblique, two chuan still admires the five hou family. Ancient Liu said the body phase dyed, peach and plum planted forest to call home. Looking back at the two hair phase gasping day, million when fairy sex gan no. How yuma hiss to tears, do not believe hongfeng one inch west. Waste temple pine Yin moon seems to be empty, Yang Feng late light rush. Black heart not too sweet diameter, set cangzhou several good clear. Blue by the next ax butterfly, caged back from the qing Ai. Fishing firewood if desire setting sun envy, guiyuan Xihe Bishuo to. Remote smallpox fell very Wushan, phoenix 珮 flying not cheng Zhuang. At the beginning of cui just like drinking milli potential, last month a heavy cattle zhu Furnace. Xiang Shuzhu zhu with virtual guest, stone potential to take core red. Old qing box bank meaning, cut yan phase excited chrysanthemum fan. River depression first take, long by the month also swim. A few feet of Wengao yun war far, put the ship township ghost dip clouds. Meet the west wind on the sill, don’t listen to the wind and smoke recognize fishing. Dike fee birds should dream yesterday, go towards from now on full dust. Avoid life throw drunk back twilight, see Sichuan who cry dream know years. But with the banquet fishy negative, do not encounter jia Tang two with spring. Old magic peeping stone tax, crane into should listen to baiyun. Whipporus (6) shore shore shore shore shore, not to hear the length of qing Qing. Chutian yu absolutely dense, drizzle zhou first wanli cool. Louver long see such as endless, water east spring night foot residual peak. Lakehead wind waves oblique drum, north Que do not come from the village. Mountain in four days, three customers, fighting to raise up to red. Nine days heavy door hand in hand, the chant must thirst for incense. Catch still knot tea wine, jingting ning burning clothes. Since Ming refused to doubt grace day, qin Hall cold rain urgent twilight frost. Delimit mouth and solitary self – loss end, xie Air qing send silver machine. And shoes with shoes which were no shoes. In order to meet the old wood, ronghua road can not rest. When idle guest after many stone, dark water horizon warm people say. Wind lane frost hao mirror, rhinoceros mill by zha lead intestine. What Lao listen to true line shi, Shi Class field ancient political hoof. Listen to the towel outside the city to see Juran, miscellaneous when the car full of incense. Outside the jade altar flower house ancient, incense brand wind is rising. Ling Qiao Cui Dai selling fairy wonderful, xiao Honglou fold shadow smell. Dare to put the bitter ballad gold word table, should be from the division sword alone frequency line. Yesterday rong withered peach Li Qing, purple monkey hard since how invasion. Risk knows the river falls in all month, Han County yibo white hair to. Still province seal body moon pavilion, I do not know who is not blowing water. More shame to send wind traces, I am afraid that whale young is after fairy.

In addition, I have extracted some of the poems that I personally think are better written (previously run, not in the picture above) :

The model is relatively simple, and the level of poetry writing is not as good as the American researcher DEMO I introduced earlier, but the basic approach should be similar, only he does it more complex.

In addition, this is a general model that can be used to study different content (ancient poetry, modern poetry, Song ci or English poetry, etc.) and generate corresponding results.

7. Introduction to deep learning

Artificial intelligence and deep learning technologies are not mysterious, but rather a new kind of tool that feeds it data and then discovers patterns behind that data and uses them for us.
A mathematical foundation is important to understand the mathematics behind the model, but for purely practical purposes, it is not necessary to have a complete grasp of mathematics, and you can also start experimenting and learning in advance.
I deeply feel the lack of computing resources. After adjusting the program parameters or training data every time, it often takes many hours to finish a training set. In some scenes, there is no difference without running more training set data, such as the case of writing poems. Personally, this is an important problem that restricts the development of AI. It directly makes the “debugging” of programs very inefficient.
Chinese documents are scarce, English documents are scarce, the open source community is constantly updating, and the content of documents is getting stale faster. As a result, learning to get started can be fraught with problems and lack of well-formed documentation.

I don’t know if the age of artificial intelligence is really coming, or where it is going, but there is no doubt that it is a new way of thinking about technology. It has always been the core purpose of our engineers to better explore and learn this new technology, and then seek for the combination point in the business application scenario, and finally achieve better results for our business. On the other hand, new technologies that give a big boost to development often evolve quickly and become ubiquitous, as we did with programming, so deep learning applications are available to anyone, not just a gimmick.

For details, see www.tensorfly.cn/ www.tensorflow.org/

Deep Learning Applications for Everyone: Getting Started (Part 2)

Related Posts

Write crawlers with Golang (II) – concurrent

Back end separation era, Java programmers change and change!

SpringBoot things to get started