1. The principle of CRF
1.1 the CRF, for example,
CRF simply refers to whether adjacent variables in the probability graph satisfy the characteristic functionOne of the models, such as the example below, is a CRF application for merchant identification. Input merchant output address, name, keywords, business scope and other information, using BIOS marking method, marking as follows:
-
Transfer characteristic function t (yi – 1, yi, x, I), t (y_ {1} I -, y_i, x, I), t (yi – 1, yi, x, I)
-
State characteristic function: s(yi,x, I)s(y_i, x, I)s(yi,x, I)
-
The transition characteristic function (t) accepts four parameters, and the state characteristic function (s) accepts three parameters:
- XXX, sentences to be marked with part of speech
- Iii, for the i-th word in sentence S
- Yiy_iyi, the part of speech marked with the i-th word
- Yi −1y_{i-1}yi−1 indicates the part of speech marked with the i-1 word
Its output value is 0 or 1, 0 means that the annotation sequence to be graded does not conform to this feature, 1 means that the annotation sequence to be graded does conform to this feature, λ,μ\lambda, \muλ,μ are the weights of the transition characteristic function t and the state characteristic function S respectively
1.2 In the merchant identification task above
-
BUSINESS KEYWORDS followed by BUSINESS scope, we can give positive marks, transfer characteristic function :(i-keywords b-business)
T (yi – 1 = “KEYWORDS”, yi = “BUSINESS”, x, I), t (y_ {1} I – = “KEYWORDS”, y_ {I} = “BUSINESS”, x, T (I) yi – 1 = “KEYWORDS”, yi = “BUSINESS”, x, I) = 1;
-
To mark Meijia as KEYWORDS, we can give a positive score and state characteristic function:
1.3 Parameterize the above processes
Probabilization (using softmax function) :
The transfer feature function and state feature function are combined, and the parameters are expressed by WWW. The above formula can be written as:
Where Z(x), Z(x), and Z(x) are normalized by, Z (x) = ∑ yexp (∑ I, k lambda KTK (yi – 1, yi, x, I) + ∑ I, l mu LSL (yi, x, I)), Z (x) = \ sum_y exp ( \sum_{i,k}^{}{\lambda_{k}}t_{k}(y_{i-1},y_{i},x,i)+\sum_{i,l}^{}{\mu_{l}}s_{l}(y_{i},x,i) (Z) (x) = ∑ yexp ∑ I, k lambda KTK (yi – 1, yi, x, I) + ∑ I, l mu LSL (yi, x, I))
2. CRF characteristic structure
The CRF model involves the following two types of feature templates:
2.1 Basic features, commonly used features in CRF model, including the following four categories:
-
Is it a number
-
English numbers: 1-10
-
Whether Chinese Numbers: ‘a’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘ten’
-
Whether Chinese traditional digital: ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’ and ‘sin’, ‘I’, ‘kwai’
-
-
Whether uppercase/lowercase
-
Text start/text end
-
Whether the first adjacent one is lowercase/uppercase; Whether the next word is lowercase/uppercase
2.2 Ngrams class characteristics
Ngram itself also refers to a collection of N words or words, each of which has an order and does not require them to be different from each other
3. Application of CRF in NER
CRF is widely used in sequence annotation. The following uses sklearn-CrFSuite package to realize a CRF sequence annotation model from four aspects: data import, feature generation, training and evaluation. The code can be run directly
3.1 Data Preparation
The import dependence on package
import sklearn
import scipy.stats
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
Copy the code
Download CoNLL 2002 data using NLTK
import nltk
nltk.download('conll2002') -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --The partition line is output, not code
>>> [nltk_data] Downloading package conll2002 to /root/nltk_data...
[nltk_data] Package conll2002 is already up-to-date!
True
Copy the code
Load conll2002 data
%%time
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
Copy the code
Viewing a piece of data
train_sents[0] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --The partition line is output, not code
>>> [('Melbourne'.'NP'.'B-LOC'),
('('.'Fpa'.'O'),
('Australia'.'NP'.'B-LOC'),
(') '.'Fpt'.'O'),
(', '.'Fc'.'O'),
('25'.'Z'.'O'),
('may'.'NC'.'O'),
('('.'Fpa'.'O'),
('EFE'.'NC'.'B-ORG'),
(') '.'Fpt'.'O'),
('. '.'Fp'.'O')]
Copy the code
Usually our data only has text and NER annotations, so we just take the text and BIOS annotations of the above data and view one piece of data
train_sents_ner = [[(i[0], i[2]) for i in row] for row in train_sents]
test_sents_ner = [[(i[0], i[2]) for i in row] for row in test_sents]
train_sents_ner[0] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --The partition line is output, not code
>>> [('Melbourne'.'B-LOC'),
('('.'O'),
('Australia'.'B-LOC'),
(') '.'O'),
(', '.'O'),
('25'.'O'),
('may'.'O'),
('('.'O'),
('EFE'.'B-ORG'),
(') '.'O'),
('. '.'O')]
Copy the code
3.2 Generated Features
Generate templates using the features of official documents
def word2features(sent, i) :
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0.'word.lower()': word.lower(),
'word[-3:]': word[-3:].'word[-2:]': word[-2:].'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit()
}
if i > 0:
word1 = sent[i-1] [0]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper()
})
else:
features['BOS'] = True
if i < len(sent)-1:
word1 = sent[i+1] [0]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.istitle()': word1.istitle(),
'+1:word.isupper()': word1.isupper()
})
else:
features['EOS'] = True
return features
def sent2features(sent) :
return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent) :
return [label for token, label in sent]
def sent2tokens(sent) :
return [token for token, label in sent]
Copy the code
What does a transformed feature look like
sent2features(train_sents_ner[0[])2]
# In the first training data, the characteristics of the third word (Australia) are as follows
----------------- The partition line is output, not code> > > {'+1:word.istitle()': False.'+1:word.isupper()': False.'+1:word.lower()': ') '.'-1:word.istitle()': False.'-1:word.isupper()': False.'-1:word.lower()': '('.'bias': 1.0.'word.isdigit()': False.'word.istitle()': True.'word.isupper()': False.'word.lower()': 'australia'.'word[-2:]': 'ia'.'word[-3:]': 'lia'}
Copy the code
Both training data and test data are converted to their characteristic representation
X_train = [sent2features(s) for s in train_sents_ner]
y_train = [sent2labels(s) for s in train_sents_ner]
X_test = [sent2features(s) for s in test_sents_ner]
y_test = [sent2labels(s) for s in test_sents_ner]
Copy the code
3.3 Model training
%%time
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=True)
crf.fit(X_train, y_train)
----------------- The partition line is output, not code
>>> CPU times: user 35 s, sys: 21.8 ms, total: 35.1 s
Wall time: 35.1 s
Copy the code
3.4 Model prediction
labels = list(crf.classes_)
labels.remove('O')
labels
----------------- The partition line is output, not code
>>> ['B-LOC'.'B-ORG'.'B-PER'.'I-PER'.'B-MISC'.'I-ORG'.'I-LOC'.'I-MISC']
Copy the code
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)
----------------- The partition line is output, not code
>>> 0.7860514251609507
Copy the code
# group B and I results
sorted_labels = sorted(labels,key=lambda name: (name[1:], name[0]))
print(metrics.flat_classification_report(
y_test, y_pred, labels=sorted_labels, digits=3) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --The partition line is output, not code
>>> precision recall f1-score support
B-LOC 0.800 0.778 0.789 1084
I-LOC 0.672 0.631 0.651 325
B-MISC 0.721 0.534 0.614 339
I-MISC 0.686 0.582 0.630 557
B-ORG 0.804 0.821 0.812 1400
I-ORG 0.846 0.776 0.810 1104
B-PER 0.832 0.865 0.849 735
I-PER 0.884 0.935 0.909 634
micro avg 0.803 0.775 0.789 6178
macro avg 0.781 0.740 0.758 6178
weighted avg 0.800 0.775 0.786 6178
Copy the code