Abstract: In this era of computing power is still possible, our researchers are committed to continuously study the general network in different scenarios, on the one hand, committed to optimizing the learning mode of neural network, which are trying to reduce the computing power resources needed by AI.

This article is shared by huawei Cloud community “OCR Performance Optimization Series (2) : From Neural Network to Silly Putty”, original author: HW007.

OCR refers to the recognition of printed words in pictures. Recently, we have been optimizing the performance of OCR model. We have rewritten the OCR network written based on TensorFlow with CudaC, and finally achieved a performance improvement of 5 times. Through this optimization work, I have a deep understanding of the general network structure of OCR network and related optimization methods. I plan to record this through a series of blog posts, as a summary of my recent work and study notes. In the first “OCR Performance Optimization Series (I) : BiLSTM Network Architecture Overview”, from the perspective of motivation to deduce how the OCR network based on Seq2Seq structure is set up step by step. And then we go from neural networks to silly putty.

1. Dive into CNN: Also talk about the nature of machine learning

Now, start with the input in the lower left corner of Figure 1 in the OCR Performance Tuning series (1) and run through the flow of Figure 1. First, input 27 text fragment images to be recognized, each image size is 32*132. These images will be encoded by a CNN network, and 32 27*384 initial coding matrices will be output. As shown below:

It is worth noting that the dimension order is adjusted in this step, that is, the input is changed from 27* (32*132) to 27* (384). You can think of it as stretching and flattening the 32*132 image into a line (1*4224), and then reducing the dimension to 1*384. This is similar to the example in the optimization strategy 1 above where the computational amount is reduced from 1024 to 128. How do I do this dimension reduction from 27*4224 to 27*384? The simplest and most crude method is the Y=AX+B model above, which simply multiplies 27*4224 by A 4224*384 matrix A trained by feeding data. Obviously, different A’s get different Y’s for the same X, and the dimension from 4224 to 384 drops A little bit more. Therefore, the scholars will sacrifice out of the mass fighting method, A A is not good, that on many A bar. Here 32 AS are substituted, and this 32 is the sequence length of the LSTM network, proportional to the width of the input image 132. When the width of the input image becomes 260, 64 A’s are needed.

Some people may ask you can’t see the 32 AS you said on CNN? Indeed, it was just my abstraction of CNN network functions. The key point was to give people a visual understanding of the dimension changes in CNN coding process and the sequence length of the LSTM network at the next layer, and to know that the sequence length of LSTM is actually the number of “dimension reducers”. If you are more intelligent, you will notice that I have made a mistake in saying “dimensionality reduction”, because if you put the output of 32 dimensionality reducers together, it is 32*384=12288, which is much larger than 4224. After the data is passed through CNN, the dimensions are not decreased, but increased! In fact, this thing is formally called “encoder” or “decoder”, “dimension reducer” is created by me in this paper for image point, no matter it is called cat or mi, I hope you can remember that its essence is just that coefficient matrix A.

Now we continue to follow the idea of 32 AS. After CNN network, the data dimension of each text picture has changed from 32*132 to 32*384 on the surface. It seems that the amount of data has increased, but the amount of information has not increased. As I describe OCR at length here, more text, more information, or the principles of OCR may make it more readable or interesting. How does THE CNN network manage to keep talking nonsense (adding data without adding information)? Implicit in this is a top-level piece of machine learning modeling called “parameter sharing.” For a simple example, if your friend happily tells you one day that he has made a new discovery, “It costs $5 for an apple, $10 for two apples, and $15 for three apples…” “, I believe you will doubt his IQ, directly say “buy n apples to 5* N dollars” not good yao? The essence of nonsense is not to make abstract summaries, but to give crazy examples and give a lot of redundant data. The essence of no nonsense is to generalize rules and experiences. Machine learning is like this example of buying an apple. You expect the machine to generalize rules and experiences by simply giving it lots of examples. This rule and experience correspond to the above OCR example, which is the network structure and model parameters of the whole model. At present, the mainstream of model network still relies on human intelligence, such as CNN network for picture scenes and RNN network for sequence scenes. If the network structure is not chosen correctly, the effect of the learned model will not be good.

In the above analysis, in fact, in terms of structure, 32 A-matrices of 4224*384 can fully meet the above-mentioned REQUIREMENTS of CNN on the dimension and size of data input and output, but in this case, the trained model will have poor effect. Because theoretically, the number of parameters for 32 4224*384 A-matrices is 32*4224*384=51904512. This means that models are too free to learn and can easily be corrupted. In the above example, buying an apple is to use fewer words to state clearly the fact that people summarize ability is stronger, because of the word in addition to the introduction of redundant Yu Wai, they may also introduce some unrelated information to the following when using this rule of interference, assuming that the price of the apple is with ten words can be clear, you spent 20 words, Chances are your other 10 words are about irrelevant information like rain or the time of day. This is the famous “Occam’s Razor”, do not make simple things complicated! In the field of machine learning, “parameter sharing” is mainly applied to the design and selection of model structure. In short, if the number of 32 A-matrix parameters of 4224*384 is too large, the model is too free in learning, so we can add restrictions to make the model less free. Make sure he explains the apple purchase in 10 words or less. The method is very simple. For example, if the 32 AS are exactly the same, then the number of parameters here is only 4224*384, which is A 32-fold reduction. If the reduction of 32 times is too harsh, then I will relax A little and don’t require the 32 as to be exactly the same, just require them to be very similar.

I call this razor trick “You don’t know how good a model is until you force one.” One might object, though I allow you to explain the price of an apple in 20 words, and that doesn’t eliminate the possibility that you’re a highly motivated, extreme person who can do it in 10 words. If we can Cover only one A matrix, the model should learn exactly the same A whether I give 32 A or 64 A. This is true in theory, but it is not realistic at present.

All models can be abstractly expressed as Y=AX, where X is the input of the model, Y is the output of the model, and A is the model. Note that A in this paragraph is different from A in the preceding paragraph, which contains both the structure and parameters of the model. The training process of the model is to solve A given X and Y. So the question above is is the solution of A unique? First of all, LET me take A step back and assume that in the real world there is A law to this problem, that is to say, this A exists and is unique in the real world, just like the laws of physics, can our model capture this law in A large amount of training data?

Take the physical mass-energy equation E=M*C^2 as an example. In this model, the structure of the model is E=M*C^2, and the parameter of the model is the speed of light C. This model, proposed by Einstein, can be said to be the crystallization of human intelligence. If use AI now solution to solve this problem, two kinds of circumstances, is a kind of strong AI, is a kind of weak AI, first of all, says the most extreme weak AI method, also is the current mainstream AI methods, most of the artificial small machine intelligence, precisely according to their own human wisdom found the relationship between E and M meet E = M * C ^ 2 such patterns, Then feed a lot of E and M data to the machine, and let the machine learn the parameter C in the model. In this case, the solution of C is unique, and only a small amount of M and C can be fed to the machine to solve C. Apparently the work of the intelligent part was how Einstein ruled out all sorts of things like time, temperature, humidity, etc., and determined that E only depended on M, and that E=M*C^2. This part of the work is currently called “feature selection” of models in machine learning, so many machine learning engineers refer to themselves as “feature engineers.”

In contrast, the expectation of strong AI is that we feed the machine a lot of data, such as energy, mass, temperature, volume, time, speed, etc., and the machine tells me that the energy in the machine depends only on the mass. The relationship is E=M*C^2, and the value of the constant C is 3.0*10^8(M /s). Here, the machine learns not only the structure of the model, but also its parameters. The first step to achieving this is to find a generalized model that can be manipulated to describe all the structures in the world, just as play-dough can be molded into a variety of shapes. This silly putty is a neural network in the field of AI, so many theoretical AI books or courses like to teach you the description ability of neural network at the beginning, proving that it is the silly putty in the field of AI. Now that there are the silly putty, the next step is how to pinch, the difficulty is in here, every scene is not the life each question with a mathematical model to perfect said, even if this layer was established, before the model structure was found, no one knows what the model, then how do you let the machine to help you make this itself, you don’t know what’s the shape of things? The only way to do that is to feed the machine lots of examples, saying that it should be able to walk, fly, etc. In fact, there is no unique solution to this problem, the machine can give you a bird, also can give you a small cockroach. There are two reasons. One is that you can’t feed every possible example to a machine. There are always black swans. The other is that feeding too many examples, the machine’s computational power is also high, which is why the neural network was proposed very early, the reason for the popularity of the last few years.

After the discussion in the above paragraph, I hope you can have a more intuitive understanding of the model structure and model parameters of machine learning at this time. We know that if the model structure is designed by human intelligence, and then the parameters are learned by machine, we only need to feed the machine very little data on the premise that the model structure is accurate, as shown above E=M*C^2, and even the model can have a good analytical solution! But after all, there’s only one Einstein, so it’s more ordinary people like us feeding the machine lots of examples in the hope that it can make shapes that we don’t even know about, let alone have such nice properties as analytic solutions. Familiar with the process of machine learning and training the students should know that machine learning to knead silly putty with the stochastic gradient descent method, popular point said, is to knead a small, to see whether pinch out of things can satisfy the requirements of your training data (that is, the feed), if you don’t meet again to knead a small, so he didn’t stop until until meet your requirements. This shows that the machine knead plasticine power is knead out of the things do not meet your requirements. This proves that the machine is a very lazy thing. When he describes the price of apples in 20 words, he has no motivation to describe the price of apples in 10 words. So, when you don’t know yourself what you want the machine to make, it’s best not to give the machine too much freedom, and then the machine will make something very complicated for you, which will meet your requirements, but won’t work very well, because it violates the razor law. However, in machine learning models, the more parameters, the greater the degree of freedom and the more complex structure, the phenomenon of “over-fitting”. Therefore, in many classic network results, some techniques are used to “parameter sharing” to achieve the purpose of reducing parameters. For example, the convolution method is used in CNN network to do parameter sharing. LSTM introduces a less drastic yield C matrix to achieve parameter sharing.

2. Knife test: LSTM waiting for you

After the discussion in the above section, I believe you have got some routines for analyzing machine learning. Finally, let’s practice starting with two-way LSTM.

As shown in the figure above, first look at the input and output of the LSTM network. The most obvious input is 32 27*384 purple matrices, and the output is 32 27*256 matrices, among which 27*256 is pieced together by two 27*128 matrices, which are output by the forward LSTM and reverse LSTM networks respectively. For simplicity, let’s just look at the forward LSTM for the moment, so that the input is actually 32 27*384 matrices, and the output is 32 27*128 matrices. According to the “dimension reducer” routine analyzed above, 32 384*128 matrices are required here. According to the “parameter sharing” routine, the structure of a real single LSTM unit is shown in the figure below:

As can be seen from the figure, the real LSTM unit is not A simple 384*128 matrix A, but the output node H of the last unit in the LSTM unit sequence is pulled down and pieced together with the input X to form A 27*512 input, multiplied by A 512*512 parameter matrix. Then, the control node C of the previous sequence output is combined to process the data obtained, and the 512 dimension is reduced to 128 dimension. Finally, two outputs are obtained, one is the new output node H of 27*128 and the other is the new control node C of 27*128. The H and C of this new output will be introduced into the next LSTM unit to influence the output of the next LSTM unit.

Here, it can be seen that due to the existence of matrix C and matrix H, even though the 512*512 parameter matrix in the 32 elements of THE LSTM sequence is exactly the same, and even though the relation between the input H and X of each element is different, they are obtained by multiplying by the same 512*512 matrix, so although they are different, They should be similar to each other because they follow a set of rules (the same 512 by 512 matrix). Here we can see that LSTM combines the output H and input X of the previous unit as the input, and introduces the control matrix C to achieve the shaver method, so as to share parameters and simplify the model. This network structure also makes the output of the current sequence unit relate to the output of the last sequence unit, which is suitable for the modeling of sequence scenes, such as OCR, NLP, machine translation and speech recognition.

Here we can see that although the neural networks is that piece of silly putty, but can’t feed all of the data, calculate the force does not support machine, or we want to increase the learning speed of the machine, in the current AI application scenario, we are all for according to actual application scenario essence and their own prior knowledge to carefully design, network structure Then the plasticine, which had been made almost as well by humans, was handed over to a machine. So I prefer to say was a time of weak AI now, in The Times of work force can also be our researchers on the one hand, is committed to continually to study the different scene in general network, such as used in image sequence of CNN, for RNN, LSTM, GRU helped, etc., on the one hand, committed to the optimization of the neural network approach to learning, For example, sdG-based optimization algorithms and training methods of reinforcement learning are all trying to reduce the computing power resources required by AI.

We believe that with the efforts of human beings, the era of strong AI will come.

Click to follow, the first time to learn about Huawei cloud fresh technology ~