Notes reprinted on GitHub project: github.com/NLP-LOVE/In…

8. Named entity recognition

8.1 an overview of the

  1. Named entities

    There are terms in the text that describe entities. Such as people’s names, place names, organization names, stock funds, medical terms, etc., are called named entities. It has the following commonalities:

    • An infinite number. There are new combinations of names for stars in the universe and newborn babies.
    • Flexible word formation. The Industrial and Commercial Bank of China, for example, can be called either ICBC or ICBC for short.
    • Categories are vague. Some place names are institutions themselves, such as the National Museum
  2. Named entity recognition

    The task of identifying the boundaries and categories of named entities in sentences is called named entity recognition. Due to the above difficulties, named entity recognition is also a statistical task, supplemented by rules.

    For the named entities with strong regularity, such as website address, E-mail, IBSN, commodity number, etc., it can be completely processed by regular expression, and the unmatched fragments can be processed by statistical model.

    Named entity recognition can also be transformed into a sequence labeling problem. This is done by attaching named entity recognition to {B,M,E,S} labels, such as “B/ME/ S-place names” for the words that make up place names, and so on. Words Outside the boundaries of named entities are uniformly marked as 0 (Outside). In practice, HanLP makes a simplification, that is, all named entities of non-compound words are labeled as S, and no category is attached. This makes the annotation set smaller and the model smaller.

In fact, named entity recognition can be regarded as the integration of word segmentation and part-of-speech tagging: the boundary of named entity can be determined by {B,M,E,S}, and its category can be determined by b-nt tags of additional categories.

HanLP internally provides the corpus conversion process, and users do not need to care about it. They only need to pass in the CORPUS path in PKU format.

8.2 Named entity recognition based on sequence annotation of Hidden Markov model

We have introduced the hidden Markov model before, for details: 4. Hidden Markov model and sequence annotation

Hidden Markov model named entity recognition code (automatic download PKU corpus): hmm_ner.py

Github.com/NLP-LOVE/In…

Running the code results in the following:

Chairman of North China Electric Power Company/NT/N Xuguang Tan/NR and/C Secretary/N Huarui Hu/NR/V New York, USA/NS Modern/NTC Art/N Museum/N visit/VCopy the code

Among them, the organization name “North China Electric Power Company” and the names “Tan Xuguang” and “Hu Huarui” were all correctly identified. However, the name “Museum of Modern Art, New York, U.S.A.” is not recognizable. There are two reasons:

  • This sample does not appear in the PKU corpus.
  • Hidden Markov models can’t make use of part-of-speech features.

For the first reason, only some additional corpus can be annotated. The second reason can be addressed by switching to a more powerful model.

8.3 Named entity recognition based on perceptron sequence annotation

We have introduced the perceptron model before, for details, see: 5. Perceptron classification and sequence labeling

The pos tagger code of perceptron model is shown in the PKU corpus: PERCEPtron_ner.py

Github.com/NLP-LOVE/In…

It will run somewhat slowly with the following results:

North China Electric Power Company/NT Chairman/N Xuguang Tan/NR and/C secretary/N Huarui Hu/NR came to/V [New York, USA/NS Modern/NTC Art/N Museum/N]/ NS visit/VCopy the code

Compared with hidden Markov model, place names can be correctly identified.

8.4 Named entity recognition based on conditional random field sequence annotation

We have introduced the CONDITIONAL random field model before, see for details: 6. Conditional random field and sequence annotation

See the pos tagline code of conditional random field model (automatic download PKU corpus): crf_ner.py

Github.com/NLP-LOVE/In…

The running time will be longer and the result is as follows:

North China Electric Power Company/NT Chairman/N Xuguang Tan/NR and/C secretary/N Huarui Hu/NR came to/V [New York, USA/NS Modern/NTC Art/N Museum/N]/ NS visit/VCopy the code

I get the same result.

8.5 Standardized evaluation of named entity recognition

The accuracy of each named entity recognition module is not only subjective through a few sentences. There is a standardized evaluation scheme for any supervised learning task. For named entity recognition, P, R and F1 evaluation indexes are introduced conventionally.

The standardized evaluation results in the Corpus of People’s Daily in January 1998 are as follows:

model P R F1
Hidden Markov model 79.01 30.14 43.64
perceptron 87.33 78.98 82.94
Conditional random field 87.93 73.75 80.22

It is worth mentioning that accuracy is closely related to evaluation strategy, feature template and corpus size. Generally speaking, when the corpus is small, simple feature templates should be used to prevent model overfitting. When the corpus is large, it is suggested to use more features for higher accuracy. When the feature template is fixed, the larger the corpus is, the higher the accuracy is.

8.6 User-defined Domain named entity recognition

What we have contacted above is the corpus in the general field, and the named entities are limited to the names of people, place names and organizations. Suppose we want to identify named entities in a specialized domain. In this case, we need to customize the domain corpus.

  1. Annotated domain named entity Recognition corpus

    First of all, we need to collect some texts as raw materials for annotating corpus, which is called generative corpus. Since our goal is to identify fighter names or models in text, the source of the corpus should be reports from military websites. In a practical project, if the request is made by the customer, the customer shall provide the raw corpus. The larger the scale of corpus, the better, usually at least thousands of sentences.

    When the raw corpus is ready, you can start to annotate. For named entity recognition corpus, if words and part of speech are the characteristics, word segmentation boundary and part of speech should be marked. However, we do not have to start from scratch, but can make corrections on the basis of HanLP annotation, which is less work.

    After thousands of samples are labeled, raw language materials are labeled as idioms. The following code automatically downloads the corpus.

  2. Training domain model

    Select perceptron as training algorithm (automatic download fighter corpus): plane_ner.py

    Github.com/NLP-LOVE/In…

    The running results are as follows:

    Download to/usr/http://file.hankcs.com/corpus/plane-re.ziplocal/ lib/python3.7 / site - packages/pyhanlp/static/data /test/plane-re.zip 100.00%, 0 MB, 552 KB/s, and 0 min 0 SEC m goyoung/NRF design /v [g /nr -/w 17/m PF/nx]/ NP: / W [Mig/NR -/ W 17/ M]/ NP PF/ N/K fighter/N than/P [MIg/NR -/ W 17/ M P/ NX]/ NP performance/N better/L. / W [MIg/NR -/ W Apache/NRF -/ W 666/ M S/q]/ NP born/L. /wCopy the code

    This sentence has appeared in the corpus, and it is not surprising that it can be recognized normally. We could fake a mig-Apache-666s, test the model’s boom capability, and still identify it correctly.

8.7 making

HanLP He Han — Introduction to Natural Language Processing notes

Github.com/NLP-LOVE/In…

The project continues to be updated at……

directory


chapter
Chapter 1: Novice on the road
Chapter 2: Dictionary segmentation
Chapter 3: Binary grammar and Chinese word segmentation
Chapter 4: Hidden Markov model and sequence annotation
Chapter 5: Perceptron classification and sequence labeling
Chapter 6: Conditional random field and sequence annotation
Chapter 7: Pos tagging
Chapter 8: Named entity Recognition
Chapter 9: Information extraction
Chapter 10: Text clustering
Chapter 11: Text classification
Chapter 12: Dependency parsing
Chapter 13: Deep learning and Natural language processing