The last article introduced the basic NER tasks, this article continues with CRF, and finally uses Bert to implement ner tasks.
1, CRF
Let’s start with two schematic diagrams.
Figure 1 is Bilstm, which is the model introduced in the previous paper. Figure 2 is Bilstm +CRF. Comparing the two figures, it is not difficult to find in Figure 2 that there is also path connection between labels, which is CRF layer. The function of CRF here is on the label of the transition probability between modeling, then of all the tags sequence, select a optimal results (called the optimal path in probability graph), for example, part of speech tagging task, behind the adjective nouns have a larger probability, so the model is more tend to choose behind the adjective with a noun.
The BiLstm+CRF network is to input the BiLstm output, that is, the value of the tag corresponding to each word (note that the BiLstm output is not recommended to use sigmoID, TANH, or Softmax for conversion) into CRF. Inside CRF, A matrix A of [tag_size,tag_size] will be randomly initialized first. Tag_size is the number of tags, so Aij represents the transition probability from tag I to tag J. This matrix was learned.
With this matrix we can calculate a score for a tag sequence.
Liu Yuan was admitted to Tsinghua University. Y is the label sequence of X. T is the length of our sentence,Represents the value of the label transferred from time t-1 to time T, obtained from the label transfer matrix A,Is the output value of BiLstm at time T. So, it’s that simple. The score of a tag sequence can be obtained by simple addition. But notice hereIt’s a score, not a probability. The following formula is needed to convert the score into probability.
hereThat’s the current statementCorresponding sequence of all tagstheThe sum. Assuming thatContains 10 words, the task has three tags,Class tag sequence. Now let’s look at training,For the statementThey all have a sequence of answers, we work out the score of answer sequence Y, use SoftMax to get its probability, and finally use maximum likelihood estimation to solve, that is, minimize the following loss function:
Simplify the formula:
hereThat’s the sequence of answer tagsThe corresponding score, this is easy to calculate, the trouble isHow to calculate. Actually, we don’t have to worry too much about it here, because we’re just solving for itThe values of all the sequences at that time can be solvedValues of all sequences at time. Here’s a quick explanation: Suppose ourThere’s only two words, soYou can take it apartWe useandTo representso 而Plug in and you getAnd the values of all the sequences at s1 are exactlySo to solve for, we just need to figure out the total value of the sequence at each time.
In the paper Bidirectional LSTM-CRF Models for Sequence Tagging, a probabilistic graph model (CRF) method is used in the directional LSTM-CRF model.
Article 2, the code
It is convenient to use CRF in TensorFlow because TensorFlow is fully encapsulated and a method can be called.
log_likelihood, trans = tf.contrib.crf.crf_log_likelihood(
inputs=logits, # logits is Bilstm output,tag value for each token, [batch_size,seq_len,tag_num]
tag_indices=self.labels,# Real tag for each token
sequence_lengths=self.lengths)# Sentence length per sample
Copy the code
The log_LIKELIHOOD returned by the method is the corresponding loss value, and trans is the transfer matrix of the label. When establishing train_OP, note that tf. Reduce_mean (-log_likelihood) is required for gradient descent.
train_op = tf.train.AdamOptimizer(learning_rate).minimize(tf.reduce_mean(-log_likelihood))
Copy the code
#### prediction stage
The prediction is to find the highest score of all tag sequences. The Viterbi algorithm is used here, and TensorFlow is also packaged.
decode_tags, best_score= tf.contrib.crf.crf_decode(potentials=logits, transition_params=trans, sequence_length=self.lengths)
Copy the code
Input:
Logits: is the output of Bilstm, which is the score per word for each tag. Transition_params: label transition probability matrix trained by CRF. Sequence_length: Predicts the sentence length of the sample.
Returns:
Decode_tags: Predicted optimal tag sequence.
Best_score: the score corresponding to the predicted optimal tag sequence.
3 Bert-Bilstm-CRF
BERT in detail (actual combat) introduced the use of BERT, you can go to see. For bert-bilstm-CRF, the output of Bert can actually be regarded as a word vector, so we only need to replace the part of the original word vector with Bert.
The source code has been submitted to GIT