Tensorflow LSTM+CTC provides end-to-end identification of indefinite number strings

Tensorflow implements end-to-end OCR: Second-generation IDENTIFICATION of ID numbers to identify 18-digit strings of fixed length with 98% accuracy. However, in practical application scenarios, it is often necessary to face the situation that the length of the string cannot be determined. At this time, in addition to training the parameters of the recognition character model, it is also necessary to train the character division model. In this paper, the method 2 mentioned above is implemented, which uses LSTM+CTC to identify the indefinite length of the number string.

Environment depends on

Environment dependency is basically the same as the previous article

Knowledge to prepare

LSTM (Short and Long-term memory network) : Is a special structure of RNN, can solve the long-term dependency problem that ordinary RNN cannot solve. See this translation for a detailed introduction to understanding LSTM networks
CTC: Connectionist Temporal Classifier is generally translated as a Connectionist Temporal Classifier, which is suitable for time series problems with uncertain alignment between input features and output labels. CTC can automatically optimize model parameters and align and cut boundaries end-to-end. For example, a 32 x 256 size image can be partitioned into 256 columns at most, that is, the maximum input feature is 256, and the maximum length of the output label is 18, which can be optimized using the CTC model. About the CTC model, the author thinks that it can be understand, assuming that 32 x 256 pictures, digital string tag is “123”, the image according to the column segmentation (CTC will optimization segmentation model), and then points out each to identify Numbers, find out the probability of every number or special character is unable to identify the marks for the special characters “-“), In this way, the generic probability distribution of each independent modeling unit individual (divided block) (including “-” node) based on the input feature sequence (picture) is obtained. Based on the probability distribution, the probability P (123) that the label sequence is “123” is calculated. Of course, the probability of “123” is set as the sum of all sub-sequences, which include continuous repeats of ‘-‘ and ‘1’, ‘2’ and ‘3’, as shown in the figure below:

The sum of probabilities of all subsequences

This paper adopts CTC encapsulation of TF framework, Tf.nn.cTC_loss, and our final goal is to minimize CTC_loss officially defined as follows:

ctc_loss(
    labels,
    inputs,
    sequence_length,
    preprocess_collapse_repeated=False,
    ctc_merge_repeated=True,
    time_major=True
)
Copy the code

inputs: [max_time_step, batch_size, num_classes]. When time_major = False, [batch_size, max_time_step, num_classes]. Overall data flow: image_batch ->[batch_size,max_time_step,num_features]->lstm ->[batch_size,max_time_step,cell.output_size]->reshape ->[batch_sizemax_time_step,num_hidden]->affine projection AW+b ->[batch_size*max_time_step,num_classes]->reshape ->[batch_size,max_time_step,num_classes]->transpose ->[max_time_step,batch_size,num_classes]

In this article, the input image size is (32,256), num_features is 32, max_time_step is 256, indicating the maximum partition sequence, where cell.output_size == num_hidden, The values for num_hidden and num_classes are defined below as constants

Labels: Labels of OCR recognition results, which is a sparse matrix, which will be explained in the training data generation part below

Sequence_length: one-dimensional data. [max_time_step,…,max_time_step] The length is batch_size and the value is max_time_step

Therefore, all we need to do is to convert the labels (the result of OCR), image data, and image length to labels, Inputs, and sequence_length

The body of the

Define some constants

# define some constants # image size, 32 x 256 OUTPUT_SHAPE = (32,256) # num_epochs = 10000 #LSTM num_hidden = 64 num_hidden = 1 obj = gen_id_card() Num_classes = obj.len + 1 + 1 # 10 digits + blank + CTC blank # INITIAL_LEARNING_RATE = 1E-3 DECAY_STEPS = 5000 REPORT_STEPS = 100 LEARNING_RATE_DECAY_FACTOR = 0.9 # The learning rate decay factor MOMENTUM = 0.9 DIGITS='0123456789' BATCHES = 10 BATCH_SIZE = 64 TRAIN_SIZE = BATCHES * BATCH_SIZECopy the code

Training data set generation

The generation of training data set is basically the same as the above, the only change is to add the option to generate random length string, the corresponding method is as follows:

def gen_text(self, is_ran=False): Vecs = np.zeros((self.max_size * self.len))) if is_ran == True: size = random.randint(1, self.max_size) else: size = self.max_size for i in range(size): c = random.choice(self.char_set) vec = self.char2vec(c) text = text + c vecs[i*self.len:(i+1)*self.len] = np.copy(vec) return text,vecsCopy the code

Batch def get_next_batch(batch_size=128): Obj = gen_id_card() #(BATch_size, 125,32) inputs = np. Zeros ([batch_size, OUTPUT_SHAPE[1],OUTPUT_SHAPE[0]]) codes = [] for i in range(batch_size): Create string image, text, Inputs: vec = obj.gen_image(True) #np. Transpose matrix transpose (32*256,) => (60,32) => (60,32) inputs[I,:] = Np.transpose (Image.shape ((OUTPUT_SHAPE[0],OUTPUT_SHAPE[1]))) # 0 0 0 0 0 0 0 0 0 0 The numbers are "12" and "1", Then targets [['1','2'],['1']] targets = [np.asarray(I) for I in codes] #targets convert sparse_targets = Sparse_tuple_from (targets) #(batch_size,) Sequence_length = 256 Seq_len = np. Ones (inputs. Shape [0]) * OUTPUT_SHAPE[1] return inputs, SPARse_targets, seq_lenCopy the code

So let’s see what a sparse matrix is, and here’s the definition of A Hundred hundred, okay

Sparse matrices with far more zero elements than non-zero elements and an irregular distribution of non-zero elements are called sparse matrices.

In fact, it is easy to understand why the label labels of OCR recognition training is a sparse matrix. Assume that the batch_size sample we generate is 64, and each sample is a string of numbers ranging from 1 to 18, then a matrix of (64,18) will be generated. The non-zero elements with numbers in the matrix and the zero elements without numbers are generated. And because the label is indeterminate, the distribution of non-zero elements is irregular, and the label stores location information as well as numeric strings.

How to convert targets into a sparse matrix in Tensorflow

Def SPARse_tuple_from (sequences, dtype= NP.int32): "" Create a sparse representention of X. Args: sequences: a list of lists of type dtype where each element is a sequence Returns: A tuple with (indices, values, shape) """ indices = [] values = [] for n, seq in enumerate(sequences): indices.extend(zip([n] * len(seq), xrange(len(seq)))) values.extend(seq) indices = np.asarray(indices, dtype=np.int64) values = np.asarray(values, dtype=dtype) shape = np.asarray([len(sequences), np.asarray(indices).max(0)[1] + 1], dtype=np.int64) return indices, values, shapeCopy the code

Indices for int64 values for indice dense_shape for sparse matrix

Indecs = [[0,0],[0,1],[1,0]] values = [1,2,1] dense_shape = [2,2] Maximum length is 2) for dense tensor:

[[1, 2], [1, 0]]Copy the code

With the method of converting sequence list to sparse matrix, conversely, of course, it also needs the method of converting sparse matrix to sequence list:

def decode_sparse_tensor(sparse_tensor): decoded_indexes = list() current_i = 0 current_seq = [] for offset, i_and_index in enumerate(sparse_tensor[0]): i = i_and_index[0] if i ! = current_i: decoded_indexes.append(current_seq) current_i = i current_seq = list() current_seq.append(offset) decoded_indexes.append(current_seq) result = [] for index in decoded_indexes: result.append(decode_a_seq(index, sparse_tensor)) return result def decode_a_seq(indexes, spars_tensor): decoded = [] for m in indexes: str = DIGITS[spars_tensor[1][m]] decoded.append(str) return decodedCopy the code

Build a network and start training

When the data preparation is completed, the training model of LSTM+CTC will be constructed. The method of TF to achieve LSTM will not be explained too much, and readers will be invited to search for it by themselves.

def get_train_model(): inputs = tf.placeholder(tf.float32, [None, None, Targets = tf.sparse_placeholder(tf.int32) #1 vector sequence length [batch_size,] seq_len = Tf.placeholder (tf.int32, [None]) # define LSTM network cell = tf.contrib.rnn.lstmcell (num_hidden, state_is_tuple=True) stack = tf.contrib.rnn.MultiRNNCell([cell] * num_layers, state_is_tuple=True) outputs, _ = tf.nn.dynamic_rnn(cell, inputs, seq_len, dtype=tf.float32) shape = tf.shape(inputs) #[batch_size,256] batch_s, max_timesteps = shape[0], shape[1] #[batch_size*max_time_step,num_hidden] outputs = tf.reshape(outputs, [-1, Num_hidden]) W = tf.variable (tf.truncated_normal([num_hidden, num_classes], stddev=0.1), name="W") b = tf.Variable(tf.constant(0., shape=[num_classes]), name="b") #[batch_size*max_timesteps,num_classes] logits = tf.matmul(outputs, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =>[max_timesteps,batch_size,num_classes] logits = tf.transpose(logits, (1, 0, 2)) return logits, inputs, targets, seq_len, W, bCopy the code

Training model

def train(): global_step = tf.Variable(0, trainable=False) learning_rate = tf.train.exponential_decay(INITIAL_LEARNING_RATE, global_step, DECAY_STEPS, LEARNING_RATE_DECAY_FACTOR, staircase=True) logits, inputs, targets, seq_len, W, B = get_train_model() # Tragets is a sparse matrix Loss = tf.nn. Ctc_loss (Labels =targets,inputs=logits, sequence_length=seq_len) cost = tf.reduce_mean(loss) #optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,momentum=MOMENTUM).minimize(cost, global_step=global_step) optimizer = Tf. train.AdamOptimizer(learning_rate=learning_rate). Minimize (loss,global_step=global_step). The cTC_beam_search_decoder method is to find the largest K probability distribution at a time. Another greedy strategy is to only find the most likely distribution. Ctc_greedy_decoder decoded, log_prob = tf.nn. Ctc_beam_search_decoder (logits, seq_len, merge_repeated=False) acc = tf.reduce_mean(tf.edit_distance(tf.cast(decoded[0], tf.int32), targets)) init = tf.global_variables_initializer() def report_accuracy(decoded_list, test_targets): original_list = decode_sparse_tensor(test_targets) detected_list = decode_sparse_tensor(decoded_list) true_numer = 0 if len(original_list) ! = len(detected_list): print("len(original_list)", len(original_list), "len(detected_list)", len(detected_list), " test and detect length desn't match") return print("T/F: original(length) <-------> detectcted(length)") for idx, number in enumerate(original_list): detect_number = detected_list[idx] hit = (number == detect_number) print(hit, number, "(", len(number), ") <-------> ", detect_number, "(", len(detect_number), ")") if hit: True_numer = true_numer + 1 print("Test Accuracy:", true_numer * 1.0 / original_list) def do_report(): test_inputs,test_targets,test_seq_len = get_next_batch(BATCH_SIZE) test_feed = {inputs: test_inputs, targets: test_targets, seq_len: test_seq_len} dd, log_probs, accuracy = session.run([decoded[0], log_prob, acc], test_feed) report_accuracy(dd, test_targets) def do_batch(): train_inputs, train_targets, train_seq_len = get_next_batch(BATCH_SIZE) feed = {inputs: train_inputs, targets: train_targets, seq_len: train_seq_len} b_loss,b_targets, b_logits, b_seq_len,b_cost, steps, _ = session.run([loss, targets, logits, seq_len, cost, global_step, optimizer], feed) print b_cost, steps if steps > 0 and steps % REPORT_STEPS == 0: do_report() save_path = saver.save(session, "ocr.model", global_step=steps) return b_cost, steps with tf.Session() as session: session.run(init) saver = tf.train.Saver(tf.global_variables(), max_to_keep=100) for curr_epoch in xrange(num_epochs): print("Epoch......." , curr_epoch) train_cost = train_ler = 0 for batch in xrange(BATCHES): start = time.time() c, steps = do_batch() train_cost += c * BATCH_SIZE seconds = time.time() - start print("Step:", steps, ", batch seconds:", seconds) train_cost /= TRAIN_SIZE train_inputs, train_targets, train_seq_len = get_next_batch(BATCH_SIZE) val_feed = {inputs: train_inputs, targets: train_targets, seq_len: train_seq_len} val_cost, val_ler, lr, steps = session.run([cost, acc, learning_rate, global_step], feed_dict=val_feed) log = "Epoch {}/{}, steps = {}, train_cost = {:.3f}, train_ler = {:.3f}, val_cost = {:.3f}, val_ler = {:.3f}, time = {:.3f}s, learning_rate = {}" print(log.format(curr_epoch + 1, num_epochs, steps, train_cost, train_ler, val_cost, val_ler, time.time() - start, lr))Copy the code

The training results

By the time of epoch 80 training, the 64 test samples were 64% accurate

Test results after training for the 80th time

By the time of the training of the 100th epoch, the accuracy of the 64 test samples reached 100%, and the accuracy was basically 100% in the subsequent years

The test accuracy is 100%

Afterword.

Finally the complete code is hosted on my Github

The image data generated by the training is in an ideal and noiseless environment, so the accuracy can reach 100% after 100 epochs. In practical application, there may be some noise of line segments or discrete points in the image. Readers can add some noise to the generated training set by themselves to test the training effect of the model

The strings generated in this paper belong to only 10 categories from 0 to 9. If 26*2 upper and lower case English letters are added later, or 3,500 + common Chinese characters are added to form the string, with the continuous expansion of the categories, can the model still be well recognized? How fast does the model converge?

In the process of compiling the sample code in this paper, I referred to others’ code and model more, and I didn’t understand the underlying principles of many things. You take notes, and then you study mathematical models and formulas and stuff.

Refer to the link

Understand LSTM network tensorFlow_LSTM_CTC_OCR Tensorflow LSTM+CTC/warpCTC usage details