Notes reprinted on GitHub project: github.com/NLP-LOVE/In…
8. Named entity recognition
8.1 an overview of the
-
Named entities
There are terms in the text that describe entities. Such as people’s names, place names, organization names, stock funds, medical terms, etc., are called named entities. It has the following commonalities:
- An infinite number. There are new combinations of names for stars in the universe and newborn babies.
- Flexible word formation. The Industrial and Commercial Bank of China, for example, can be called either ICBC or ICBC for short.
- Categories are vague. Some place names are institutions themselves, such as the National Museum
-
Named entity recognition
The task of identifying the boundaries and categories of named entities in sentences is called named entity recognition. Due to the above difficulties, named entity recognition is also a statistical task, supplemented by rules.
For the named entities with strong regularity, such as website address, E-mail, IBSN, commodity number, etc., it can be completely processed by regular expression, and the unmatched fragments can be processed by statistical model.
Named entity recognition can also be transformed into a sequence labeling problem. This is done by attaching named entity recognition to {B,M,E,S} labels, such as “B/ME/ S-place names” for the words that make up place names, and so on. Words Outside the boundaries of named entities are uniformly marked as 0 (Outside). In practice, HanLP makes a simplification, that is, all named entities of non-compound words are labeled as S, and no category is attached. This makes the annotation set smaller and the model smaller.
In fact, named entity recognition can be regarded as the integration of word segmentation and part-of-speech tagging: the boundary of named entity can be determined by {B,M,E,S}, and its category can be determined by b-nt tags of additional categories.
HanLP internally provides the corpus conversion process, and users do not need to care about it. They only need to pass in the CORPUS path in PKU format.
8.2 Named entity recognition based on sequence annotation of Hidden Markov model
We have introduced the hidden Markov model before, for details: 4. Hidden Markov model and sequence annotation
Hidden Markov model named entity recognition code (automatic download PKU corpus): hmm_ner.py
Github.com/NLP-LOVE/In…
Running the code results in the following:
Chairman of North China Electric Power Company/NT/N Xuguang Tan/NR and/C Secretary/N Huarui Hu/NR/V New York, USA/NS Modern/NTC Art/N Museum/N visit/VCopy the code
Among them, the organization name “North China Electric Power Company” and the names “Tan Xuguang” and “Hu Huarui” were all correctly identified. However, the name “Museum of Modern Art, New York, U.S.A.” is not recognizable. There are two reasons:
- This sample does not appear in the PKU corpus.
- Hidden Markov models can’t make use of part-of-speech features.
For the first reason, only some additional corpus can be annotated. The second reason can be addressed by switching to a more powerful model.
8.3 Named entity recognition based on perceptron sequence annotation
We have introduced the perceptron model before, for details, see: 5. Perceptron classification and sequence labeling
The pos tagger code of perceptron model is shown in the PKU corpus: PERCEPtron_ner.py
Github.com/NLP-LOVE/In…
It will run somewhat slowly with the following results:
North China Electric Power Company/NT Chairman/N Xuguang Tan/NR and/C secretary/N Huarui Hu/NR came to/V [New York, USA/NS Modern/NTC Art/N Museum/N]/ NS visit/VCopy the code
Compared with hidden Markov model, place names can be correctly identified.
8.4 Named entity recognition based on conditional random field sequence annotation
We have introduced the CONDITIONAL random field model before, see for details: 6. Conditional random field and sequence annotation
See the pos tagline code of conditional random field model (automatic download PKU corpus): crf_ner.py
Github.com/NLP-LOVE/In…
The running time will be longer and the result is as follows:
North China Electric Power Company/NT Chairman/N Xuguang Tan/NR and/C secretary/N Huarui Hu/NR came to/V [New York, USA/NS Modern/NTC Art/N Museum/N]/ NS visit/VCopy the code
I get the same result.
8.5 Standardized evaluation of named entity recognition
The accuracy of each named entity recognition module is not only subjective through a few sentences. There is a standardized evaluation scheme for any supervised learning task. For named entity recognition, P, R and F1 evaluation indexes are introduced conventionally.
The standardized evaluation results in the Corpus of People’s Daily in January 1998 are as follows:
model | P | R | F1 |
---|---|---|---|
Hidden Markov model | 79.01 | 30.14 | 43.64 |
perceptron | 87.33 | 78.98 | 82.94 |
Conditional random field | 87.93 | 73.75 | 80.22 |
It is worth mentioning that accuracy is closely related to evaluation strategy, feature template and corpus size. Generally speaking, when the corpus is small, simple feature templates should be used to prevent model overfitting. When the corpus is large, it is suggested to use more features for higher accuracy. When the feature template is fixed, the larger the corpus is, the higher the accuracy is.
8.6 User-defined Domain named entity recognition
What we have contacted above is the corpus in the general field, and the named entities are limited to the names of people, place names and organizations. Suppose we want to identify named entities in a specialized domain. In this case, we need to customize the domain corpus.
-
Annotated domain named entity Recognition corpus
First of all, we need to collect some texts as raw materials for annotating corpus, which is called generative corpus. Since our goal is to identify fighter names or models in text, the source of the corpus should be reports from military websites. In a practical project, if the request is made by the customer, the customer shall provide the raw corpus. The larger the scale of corpus, the better, usually at least thousands of sentences.
When the raw corpus is ready, you can start to annotate. For named entity recognition corpus, if words and part of speech are the characteristics, word segmentation boundary and part of speech should be marked. However, we do not have to start from scratch, but can make corrections on the basis of HanLP annotation, which is less work.
After thousands of samples are labeled, raw language materials are labeled as idioms. The following code automatically downloads the corpus.
-
Training domain model
Select perceptron as training algorithm (automatic download fighter corpus): plane_ner.py
Github.com/NLP-LOVE/In…
The running results are as follows:
Download to/usr/http://file.hankcs.com/corpus/plane-re.ziplocal/ lib/python3.7 / site - packages/pyhanlp/static/data /test/plane-re.zip 100.00%, 0 MB, 552 KB/s, and 0 min 0 SEC m goyoung/NRF design /v [g /nr -/w 17/m PF/nx]/ NP: / W [Mig/NR -/ W 17/ M]/ NP PF/ N/K fighter/N than/P [MIg/NR -/ W 17/ M P/ NX]/ NP performance/N better/L. / W [MIg/NR -/ W Apache/NRF -/ W 666/ M S/q]/ NP born/L. /wCopy the code
This sentence has appeared in the corpus, and it is not surprising that it can be recognized normally. We could fake a mig-Apache-666s, test the model’s boom capability, and still identify it correctly.
8.7 making
HanLP He Han — Introduction to Natural Language Processing notes
Github.com/NLP-LOVE/In…
The project continues to be updated at……
directory
chapter |
---|
Chapter 1: Novice on the road |
Chapter 2: Dictionary segmentation |
Chapter 3: Binary grammar and Chinese word segmentation |
Chapter 4: Hidden Markov model and sequence annotation |
Chapter 5: Perceptron classification and sequence labeling |
Chapter 6: Conditional random field and sequence annotation |
Chapter 7: Pos tagging |
Chapter 8: Named entity Recognition |
Chapter 9: Information extraction |
Chapter 10: Text clustering |
Chapter 11: Text classification |
Chapter 12: Dependency parsing |
Chapter 13: Deep learning and Natural language processing |