CRNN,

preface

This paper is mainly based on Bestrivern’s blog to understand CRNN+CTC network, deepen the understanding of the network, and organize the generated perception

Summary of a.

Commonly used text recognition algorithms mainly have two frameworks:

  • CNN+RNN+CTC(CRNN+CTC)
  • CNN+Seq2Seq+Attention

This article introduces the first method.

CRNN is a convolutional cyclic neural network structure, which is used to solve the image based sequence recognition problem, especially the scene text recognition problem.

The paper considers that the character recognition is a prediction method for sequence, so the RNN network for sequence prediction is adopted. After the image features are extracted by CNN, the sequence is predicted by RNN, and the final result is obtained by a TRANSLATION layer of CTC. In short, it is the structure of CNN+RNN+CTC.

CRNN, known as Convolutional Recurrent Neural Network, is mainly used for end-to-end recognition of fluctuating text sequences. Instead of cutting a single text first, it transforms text recognition into a time-dependent sequence learning problem, which is an image based sequence recognition.

Ii.CRNN network structure

                                     

The entire CRNN network structure consists of three parts, from bottom to top:

  1. CNN (convolution layer), deep CNN is used to extract features from input images and get feature maps;
  2. RNN (cyclic layer), use bidirectional RNN (BLSTM) to predict the feature sequence, learn each feature vector in the sequence, and output the prediction label (true value) distribution;
  3. CTC Loss (transcription layer) is used to convert a series of label distributions obtained from the cyclic layer into the final tag sequence.

What’s the problem with end-to-end OCR? Is how to deal with the indefinite long sequence alignment problem! CRNN OCR is actually borrowed from speech recognition to solve the idea of variable length speech sequences. Like a speech recognition problem, OCR can be modeled as a time-dependent word or phrase recognition problem. The algorithm of training RNN based on Connectionist Temporal Classification (CTC) significantly outperforms traditional speech recognition algorithms in the field of speech recognition. Some scholars have tried to use CTC loss function for reference in OCR recognition, and CRNN is a representative algorithm. CRNN algorithm input 100 * 32 normalized height entry image, extract feature Map based on 7-layer CNN (commonly used VGG16), divide the feature Map by column (map-to-sequence), and input 512-dimensional features of each column into bidirectional LSTM of 256 units of two layers for classification. In the training process, the soft alignment of character position and class mark is realized by the guidance of CTC loss function.

CRNN borrows from the modeling method of LSTM+CTC in speech recognition. The difference is that the features input from LSTM are replaced by the image feature vectors extracted from CNN network from the acoustic features of the speech field (MFCC, etc.). The biggest contribution of CRNN algorithm is to combine the potential of CNN for image feature engineering with the potential of LSTM for serialization recognition. It not only extracts robust features, but also avoids single-character segmentation and single-character recognition, which are very difficult in traditional algorithms, by means of sequence recognition. Meanwhile, serialization recognition is embedded with sequential dependency (implicit use of corpus). In the training phase, CRNN uniformly scaled the training images by 100×32 (w × h); In the test stage, to solve the problem of reduced recognition rate caused by character stretch, CRNN kept the size proportion of input image, but the height of image still had to be 32 pixels. The size of convolution feature map dynamically determined the LSTM timing length. Here’s an example

Now there is an image input. In order to input features to Recurrent Layers, do the following:

  • The image will first be scaled to 32×W×1
  • And then after CNN it becomes 1× (W/4) × 512
  • And then for the LSTM, setT=(W/4), D=512, the feature can be entered into LSTM.
    • Notice the T here, see it later in the passage, and remember what it means
  • LSTM has 256 hidden nodes,It goes through LSTM and becomes a vector of length T by nclass.After softMax processing, each element of the column vector represents the corresponding character prediction probability.Finally, the prediction results of T can be merged into a complete recognition result by de-redundancies.

                                   

1.CNN 

Structure diagram of the convolution layer:

There are four maximum pooling layers, but the window size of the last two pooling layers is changed from 2×2 to 1×2. This means that the height of the image is halved four times (divided by $2^4$) and the width is halved only twice (divided by $2^2$). This is because most text images are smaller in height and longer in width. Therefore, its feature map also has the rectangular shape of height, small width and long length. If the pool window of 1×2 is used, the information in the width direction can be guaranteed as far as possible, and it is more suitable for English letter recognition (such as distinguishing I and L).

CRNN also introduces the BatchNormalization module, which accelerates model convergence and shortens the training process.

Input image is grayscale image (single channel); The height is 32, which is fixed, and when the picture passes CNN, the height becomes 1, which is very important; The width is 160, and the width can also be other values, but it needs to be unified, so the data size of the input CNN is (channel, height, width)=(1, 32, 160).

The output size of CNN is (512, 1, 40). In other words, CNN finally obtains 512 feature maps, each of which has a height of 1 and a width of 40.

Note: The final convolution layer is a convolution of 2 * 2,s=1,p=0, At this point, it is equivalent to scaling the feature map to the original half, so the entire CNN layer will scale the h of the image to the original 124∗2=132\frac{1}{2^4 * 2}=\frac{1}{32}24∗21=321, Therefore, the height of featureMap output by CNN is 1.

assert imgH % 16 == 0, 'imgH has to be a multiple of 16'
Copy the code

In the program, the h of the image must be an integer multiple of 16.

assert h == 1, "the height of conv must be 1"
Copy the code

In forward propagation, H of featuremap obtained by CNN must be 1.

Finally, the featureMap obtained by CNN is 512x1x40

2.Map-to-Sequence

We cannot directly send the feature map obtained by CNN into RNN for training, and some adjustments are needed to extract the feature vector sequence required by RNN according to the feature map.

                     

Now need to extract features from CNN model produce characteristic figure vector sequence, each feature vector (pictured above in a red box) on the characteristic graph generated in columns from left to right, each column contains 512 dimensional characteristics, which means that the ith a figure feature vector is all the columns I pixels of connection, the characteristic vector, make up a sequence.

Because the convolution layer, the maximum pooling layer, and the activation function are performed locally, they are shift invariant. Thus, each column of the feature map (that is, an feature vector) corresponds to a rectangular region of the original image (called the receptive field), and these rectangular regions are in the same order as the corresponding columns on the feature map from left to right. Each vector in the feature sequence is associated with a receptive field.

Specifically, these 40 sequence vectors, with stride=4 respectively, correspond to the original image and are used to classify the relevant regions of the original image.

Your own understanding

In this case, it’s a vertical region that corresponds to a sequence of features.

Specifically, the 40 sequence vectors, corresponding to the original image with stride=4, are used to classify the relevant regions of the original image —–. The sentence indicates that one of my sequences corresponds to a vertical region, and the previous width is 160. We move with the step size of 4, and exactly divide 40 sequence features

                                  

These sequences of eigenvectors serve as the input to the loop layer, and each eigenvector serves as the input to the RNN at a time step.

Your own understanding

Why did CNN introduce RNN?

Because before has not been very understanding about the circulation network input features (sequence) what is the relationship, in fact we character sequence is the wide average divided into 40, 160, is a continuous sequence is a sequential relationship between before and after, so we can use the RNN to feature extraction of the sequence, combined with the operation of a softmax complete classification

3.RNN

Because RNN has the problem of disappearing gradients and can’t get more context information, LSTM is used in CRNN, and LSTM’s special design allows it to capture long-distance dependencies.

LSTM is unidirectional and uses only past information. In image-based sequences, however, the contexts in both directions are mutually useful and complementary. Combine two LSTMS, one forward and one backward, into a bidirectional LSTM. In addition, multiple layers of bidirectional LSTM can be stacked, and the deep structure allows for a higher level of abstraction than the shallow one.

The bidirectional LSTM network with 256 cells in two layers is adopted here:

Through the above step, 40 eigenvectors are obtained, each of which has a length of 512. In LSTM, each time step is passed into an eigenvector for classification. There are 40 time steps here.

We know that an eigenvector is equivalent to a small rectangular region in the original image, and the goal of RNN is to predict which character this rectangular region is, that is, to predict according to the input eigenvector and get the SoftMax probability distribution of all characters, which is a vector whose length is the number of character categories, as the input of CTC layer.

Because every time step has an input eigenvector
x T x^T
, output a probability distribution of all characters
y T y^T
, so the output is a posteriori probability matrix composed of 40 vectors whose length is the number of character categories.

Your own understanding

Why is the dimension of the generated posterior probability matrix T by n?

Since the input of an LSTM is an eigenvector, and the output is the prediction of a character, predicting the probability of each character's occurrence, the network structure here has to be mentioned. In fact, there are (two) double-layer LSTM layer + SoftMax layer. An LSTM will get n values of the probability of all characters, so there are T sequences in total. So there's going to be T n values, and the dimension is going to be T times n

As shown in the figure below:

This posterior probability matrix is then passed into the transcription layer.

The source code for this section is as follows:

self.rnn = nn.Sequential(
            BidirectionalLSTM(512, nh, nh),
            BidirectionalLSTM(nh, nh, nclass))
Copy the code

Then the parameters are set as follows:

nh=256
nclass = len(opt.alphabet) + 1
nc = 1
Copy the code

The implementation of bidirectional LSTM is as follows:

class BidirectionalLSTM(nn.Module):
    def __init__(self, nIn, nHidden, nOut):
        super(BidirectionalLSTM, self).__init__()
 
        self.rnn = nn.LSTM(nIn, nHidden, bidirectional=True)
        self.embedding = nn.Linear(nHidden * 2, nOut)
 
    def forward(self, input):
        recurrent, _ = self.rnn(input)
        T, b, h = recurrent.size()
        t_rec = recurrent.view(T * b, h)
 
        output = self.embedding(t_rec)  # [T * b, nOut]
        output = output.view(T, b, -1)
        return output
Copy the code

Output =[40 * 256,256]; view output=[40,256,256]

Output =[40*256,nclass] output=[40,256,nclass]

4. CTC loss (Focus on)

This is the most difficult part of CRNN. This layer is the transcription layer, which is the process of converting the prediction made by RNN for each feature vector into the tag sequence. Mathematically, transcription is finding the sequence of tags with the highest probability combination based on each frame prediction.

The difficulty of end-to-end OCR recognition lies in how to deal with the problem of sequence alignment with variable length! OCR can be modeled as a timing dependent text image problem, and then the loss function of CTC (Connectionist Temporal Classification (CTC)) is used to conduct end-to-end joint training of CNN and RNN.

4.1 Sequence merge mechanism

Now we need to translate the sequence output by RNN into the final recognition result. When RNN carries out the temporal classification, it is inevitable that there will be a lot of redundant information. For example, a letter is recognized twice consecutively, which requires a set of redundancy mechanism.

                                               

For example, if we want to recognize the text above, where there are five time steps in RNN, ideally t0, T1, t2 should all map to “a”, t3, T4 should all map to “b”, and then connect these character sequences to get “aaabb”, and then merge the continuously repeating characters into one, So the end result is “ab”.

This seems like a good approach, but there is a problem. If it’s a word like book, hello, etc., combining consecutive characters gives a bok and helo, which obviously doesn’t work, so CTC has a blank mechanism to solve this problem.

We use the “-” symbol to represent blank. When RNN outputs a sequence, it inserts a “-” between repeated characters in the text label. For example, if the output sequence is “bbooo-ookk”, it will be mapped to “book” at the end.

In other words, the sequence of characters is to delete successive repeated characters, and then delete all “-” characters from the path, which is called the decoding process, and the encoding is realized by the neural network. By introducing the blank mechanism, we can solve the problem of repeating characters very well.

Your own understanding

Code: neural network to implement

Decoder: Removes all duplicate characters and blank from path

The same text label can have multiple different character alignment combinations, for example, “aa-b” and “aabb” and “-abb” all represent the same text (” ab “), but with different alignment than the image. More generally, a text label has one or more paths.

4.2 Training Stage

In the training stage, we need to get the loss function according to these probability distribution vectors and the corresponding text labels, so as to train the neural network model. Now let’s see how to get the loss function.

                                              

As shown in the figure above, for the simplest character recognition with timing 2, there are two time steps (T0, T1) and three possible characters “a”, “b” and “-“, we get two probability distribution vectors. If we take the method of decoding the maximum probability path, the probability of “–” is the maximum. That is, the probability that the real character is null is 0.6 * 0.6=0.36.

However, there are multiple alignments for the character “a”. “aa”, “a-” and “-a” all represent “a”, so the probability of printing “a” should be the sum of the three:

0.4 * 0.4 + 0.4 * 0.6 + 0.6 * 0.4 = 0.16 + 0.24 + 0.24 = 0.64

So the probability of “A” is higher than the probability of the empty “-“! If the label text is “A”, the loss function is calculated by calculating the sum of the fractions of all possible aligned combinations (or paths) for “A” in the image.

Therefore, for RNN, given the input probability distribution matrix is, T is the sequence length, and the final total probability mapped to label text L is:

                                                                                    

Bl−1B_{l}^{-1}Bl−1 represents the set of all paths of the text L after the transformation of the mapping function B from sequence to sequence, and π\ PI π is one of the paths. The probability of each path is the product of the fractions of the corresponding characters in each time step.

Similar to the general classification, the loss function O of CTC is defined as the negative maximum likelihood. For the convenience of calculation, the logarithm of the likelihood is taken.

We just need to train the network to maximize the probability value. Similar to ordinary classification, the loss function of CTC is defined as the negative maximum likelihood function of the probability. For the convenience of calculation, the logarithm of the likelihood function is taken.

                                                                  

By calculating the loss function, the previous neural network can be propagated back, and the parameters of the neural network are updated according to the optimizer used, so as to find the characters corresponding to the most likely pixel region.

The method of mapping transformation and the sum of all possible path probabilities makes it unnecessary for CTC to accurately slice the original input character sequence.

4.3 Test Phase

Your own understanding

Notice here, in the training phase, we calculate for all paths, but in the test part, in order to get the results quickly, we use dynamic programming to get the fastest results.

In the test phase, the process is different from the training phase, we use the trained neural network to recognize the new text image. At this time, we do not know any text in advance. If we calculate all the paths of every possible text like above, for a long time step and a long character sequence, this calculation is very large, and it is not a feasible solution.

This part is important: We know that the output of RNN in each time step is the probability distribution of all character categories, that is, a vector containing the fraction of each character. We take the character with the maximum probability as the output character of the time step, and then concatenate all the time steps to get a character to get a sequence path, that is, the maximum probability path. Then the final prediction text result is obtained according to the merge sequence method introduced above

In the output stage, after CTC translation, that is, the sequence feature information learned by the network is transformed into the final recognition text, and the whole text image can be recognized.

                                

For example, in the figure above, there are 5 time steps, and the character categories are “a”, “b” and “-” (blank). For the probability distribution of each time step, we take the character with the largest score, so we get the sequence path “aaA-b”, first remove the adjacent repeating characters to get “a-B”. Then remove the blank character to get the final result: “ab”.

4.4 summarize

In the prediction process, the standard CNN network is used to extract the features of the text image, and then BLSTM is used to fuse the feature vectors to extract the context features of the character sequence, and then the probability distribution of each feature column is obtained. Finally, the text sequence is predicted by the transcription layer (CTC).

BLSTM and CTC are used to learn the context relationships in text images, so as to effectively improve the accuracy of text recognition and make the model more robust.

In the training phase, CRNN uniformly scaled the training images to 160×32 (w × h); In the test stage, in view of the problem that character stretch will lead to reduced recognition rate, CRNN keeps the size proportion of input image, but the image height must be uniformly 32 pixels. The size of convolution feature map dynamically determines the timing length (time step) of LSTM.

V. Supplementary explanation

5.1 Coding of RCNN

Your own understanding

The picture of the blogger summed up very in place, image stereo CNN + LSTM + Softmax process

        

Suppose there are 26 letters to identify, so the number of categories =27 (and a blank character)

Assuming that the OUTPUT of CNN is based on 50 sequences (RNN recognition of handwritten digits), the sequence is too large to be trained properly, and the recognition results will miss letters. The sequence is too small and can’t be trained properly. It can recognize multiple letters.

5.2 the CTC,

As shown in the figure below, in order to facilitate readers’ understanding, the structure of RNN is simplified and there is only one-way one-layer LSTM. The acoustic modeling unit is selected as the letter {A-Z}, and the character set of the modeling unit is extended. In addition, the many-to-one mapping function from the output layer to the final label sequence is defined, so that the output layer of RNN can be mapped to the final label sequence.

Many-to-one mapping function:

  • Consecutive identical characters are deduplicated
  • Remove null characters

                      

Therefore, if you want to calculate 𝒑(𝒛│𝒙), you can add up the probability of its corresponding output sequence (that is, the “path” mapped to the final label), as shown in the figure below.

As shown in the figure below, the definition of CTC Loss function can be obtained based on the conditional independence hypothesis of RNN:

Your own understanding (The core part of the principle)

Actually the place such as we see in the figure above PI \ PI PI is the probability of x after a path of arrived, so the formula is p (PI) ∣ x = ∏ t = 1 ty PI TTP \ | PI (x) = \ prod ^ {t} _ {t = 1} y_ {\ pi_t} {t} p (PI) ∣ x = ty PI tt ∏ t = 1 ∀π∈LT\forall\ PI \in L^T∀π∈LT, since we have multiple-to-one, eventually there are multiple paths to X, and the probability of the final label is equal to the sum of the probability of all paths, So the formula of p (z ∣ x) = ∑ PI ∈ B (z) – 1 p (PI) ∣ x p (z | x) = \ sum_ {\ \ PI in B ^ {1}} (z) p \ | PI (x) p (z) ∣ x = ∑ PI ∈ B – 1 B (z) p (PI) ∣ x – 1 B ^ {1} – 1 B is the set all the path of the mapping function

Since we know that CTC is the maximum product of each probability output by RNN and the likelihood function, so if we want to minimize the Loss, we adopt the idea of negative logarithm. Formula is L (s) = – ln ∏ (x, z ∈ Sp (z ∣ x) L (s) = – ln \ prod_ {(x, z)} {_ \} in _Sp (z | x) L (s) = – ln ∏ (x, z ∈ Sp (z ∣ x) the final finishing – ∑ (x, z) ∈ SLNS ∑ PI detected ∈ B – (z) ∏ t = 1 ty PI tt – \ sum_ {(x, z) \ s} in ln \ sum_ {\ \ PI in (z) B ^ {1}} \ prod_ {t = 1} ^ {t} y_ {\ pi_t} ^ t – ∑ (x, z ∈ SLNS ∑ PI detected ∈ B – (z) ∏ t = 1 ty PI tt ∀ PI ∈ LT \ forall \ \ in L ^ t ∀ PI PI ∈ LT

Assuming that single-layer LSTM is selected as RNN structure, the final model structure is as follows:

The above figure is a good illustration of the process of RNN + CTC, and it is worth thinking about

5.3 DYNAMIC programming of CTC simplifies the calculation process

Due to the high complexity of direct violence calculation 𝒑(𝒛│𝒙), the author uses the forward-backward algorithm of HMM for reference and uses the dynamic programming algorithm to solve the problem.

Said below, in order to more image search space, with the X-axis represents the time sequence, Y axis represents the output sequence, and the output sequences do standardized processing, the output sequences and blank, middle and end with l said the final label, l ‘said extended form, by 2 | | l + 1 = 2’ | | l, such as: L = = > ‘l = a_p_p_l_e apple

                         

Not all paths in the figure are legal paths, and all legal paths need to follow some constraints, as shown in the figure below:

  • Only the right down direction is allowed, other directions are not allowed
  • There must be at least one null character between the same characters
  • Non-null characters cannot be skipped
  • The starting point must begin with the first two characters
  • The endpoint must fall within the last two characters

                        

                        

                    

                       

Therefore, according to the above constraint rules, all legal paths mapped to “apple” are traversed, and the final sequence T=8, and all paths of label = “apple” are shown as follows:

                 

Now, how do we sum the probabilities of these paths? Violent traversal? Divide and conquer? The author draws lessons from the forward-backward algorithm of HMM and uses the dynamic programming algorithm to solve the problem. The path set can be divided into Forward and Backward parts, as shown in the figure below:

                 

A3 in figure (3) = (a2 + a2 (2) (3)) ∗ y_3a_3 (3) = (a_2 (3) + a_2 (2)) * y_ {\ _} ^ 3 a3 (3) = (a2 + a2 (2) (3)) ∗ y_3 meaning is: A3 (3) A_3 (3) A3 (3) can be converted from A2 (3)a_2(3) A2 (3) and A2 (2)a_2(2) A2 (3), and finally multiply by y_3y_{\_}^3y_3 which is the probability value of the current point — non-null characters can only come from this or an upper letter layer

Figure in the a5 (6) = (a4 (6) + a4 (5)) ∗ yp5a_5 (6) = (a_4 (6) + a_4 (5)) * y_ {p} ^ 5 a5 (6) = (a4 (6) + a4 (5)) ∗ yp5 meaning is: A5 (6)a5(6) A_5 (6)a5(6) converts from A4 (6) A_4 (6) A4 (6) and A4 (5)a_4(5) A4 (5) and a4(5)a_4(5) A4 (5), and finally times yp5y_{p}^5yp5 is the probability of the current point — if two adjacent letters are the same, The source can only be the upper empty and this layer itself

A4 in figure (4) = (a3 (4) + a3 (3), a3 (2)) ∗ yp4a_4 (4) = (a_3 a_3 (4) + (3) + a_3 (2)) * y_ {p} ^ 4 a4 (4) = (a3 (4) + a3 (3), a3 (2)) ∗ yp4 meaning is: A4 a_4 a4 (4) (4) and (4) from the a3 (4) a_3 a3 (4) and (4) a3 (3) a_3 a3 (3) and (3) a3 (2) a_3 a3 (2) into (2), Finally, multiply by yp4y_{p}^4yp4 to give the probability of the current point – if two adjacent letters are different letters, the source can only be the void and the layer itself, or the jump between the two layers

                   

After the forward probability is solved through dynamic programming, the forward probability can be used to calculate the CTC Loss function, as shown in the figure below:

The idea of recursion

                   

                   

Similarly, we can define the reverse probability and use it to calculate the CTC Loss function, as shown in the figure below:

                                    

                      

The CTC Loss function can also be calculated by removing the arrow direction and combining the forward probability and backward probability, which is a very important step for the derivative calculation of the CTC Loss function, as shown in the figure below:

Probability synthesis of all paths passing through point P at time t=4: A4 B4 (4) (4) the yp4 \ frac {a_4 B_4 (4) and (4)} {y_p ^ 4} yp4a4 B4 (4) and (4) may be regarded as prior to the probability of all paths and a4 (4) the yp4 \ frac {a_4 (4)} {y_p ^ 4} yp4a4 (4) and back to all the probability and B4 (4) the yp4 \ frac {B_4 (4)} {y_p ^ 4} yp4B4 comprehensive, (4) the denominator, combination is the above formulas a4 B4 (4) (4) the yp4 \ frac {a_4 B_4 (4) and (4)} {y_p ^ 4} yp4a4 B4 (4) and (4)

After time t=4, traversing all S, the probability synthesis of all paths at time T4 can be reached: a4(4)B4(4)yp4+a4(5)B4(5)y_4+a4(6)B4(6)yp4\frac{a_4(4)B_4(4)}{y_p^4} + \frac{a_4(5)B_4(5)}{y_{\_}^4} + \frac{a_4(6)B_4(6)}{y_p^4}yp4a4(4)B4(4)+y_4a4(5)B4(5)+yp4a4(6)B4(6)

         

Through the above, we can get the recursion formula: CTC Loss function is calculated for forward and backward probabilities as well as forward and backward probabilities at any time

In summary, the CTC Loss function is calculated according to the forward probability, and the following conclusions are obtained (key points) :

CTC Loss function was calculated according to the backward probability, and the following conclusions were obtained:

The CTC Loss function was calculated according to the forward probability and backward probability at any time, and the following conclusions were obtained:

                   

So far, we have obtained the effective calculation method of CTC Loss, and then take the derivative of it

The red part in the figure below is the core part of the derivative of CTC Loss function (do not insist on understanding, but focus on the calculation formula of Loss above) :

The derivation process of CTC Loss function relative to RNN output layer elements is shown in the figure below:

Thus, the explanation of CTC Loss in the training process has been completed.

There are two main methods of decoding CTC Loss in the test process:

CTC Prefix Search Decoding CTC Beam Search Decoding