Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”
Brief sequence annotation
As the most basic task in NLP, Sequence Tagging is widely used, such as word segmentation, POS Tagging, Named Entity Recognition, NER, keyword extraction, Semantic Role Labeling, Slot Filling and so on all belong to the category of sequence Labeling in essence.
Brief introduction to named entity recognition
Named Entity Recognition (NER), also known as “specific name Recognition”, refers to the Recognition of entities with specific meanings in texts, including people’s names, place names, organization names, proper nouns and so on.
The role of named entity recognition
Named entity recognition is an important basic tool for information extraction, question answering system, syntactic analysis, machine translation, metadata annotation for Semantic Web and other application fields. It plays an important role in the process of the practical application of natural language processing technology. Generally speaking, the task of named entity recognition is to identify named entities in three categories (entity category, time category and number category) and seven sub-categories (person name, organization name, place name, time, date, currency and percentage) in the text to be processed.
The process composition of named entity recognition
Named entity recognition usually consists of two parts:
- Entity boundary recognition;
- Determine the entity category (person, place, organization, or other).
Description of Label Types
When conducting named entity recognition, each word is usually marked with a single word in Chinese and a word in English, separated by space. Label types are as follows:
type | instructions |
---|---|
B | Begin: indicates the beginning of the entity fragment |
I | Internediate, which represents the middle of the entity fragment |
M | Middle stands for the Middle of the entity fragment |
E | End, representing the End of the entity fragment |
S | Single, representing the entity fragment as a Single word |
O | Other: indicates that the character is not any entity |
Briefly describe three methods of sequence annotation
Three common sequence labeling methods for entity recognition are as follows:
- BIO: Identifies the beginning, middle, and non-physical parts of the entity
- BMES: Added the annotation of S single entity condition
- BIOSE: Added the end identifier for entity E
Bio-three-digit Sequence labeling (B-BEGIN, I-Inside, O-Outside)
- B-X represents the beginning of entity X
- I-x represents the end of the entity
- O stands for not of any type
Sample:
I am O Li B-per fruit I-per frozen I-per, O I love O Chinese B-org, O I come from O four b-loc chuan i-loc. OCopy the code
BMES- Four digit Sequence notation (B-BEGIN, M-Middle, E-end, S-Single)
- B represents the primordial value of a word
- M stands for the middle of a word
- E represents the final position of a word
- S stands for a single word
Sample:
I am S, M, ECopy the code
BIOES- 4-digit Sequence labeling (B-BEGIN, I-Inside, O-Outside, E-end, S-Single)
- B means start
- I stands for inside
- O stands for nonentity
- E stands for solid tail
- S means the word itself is an entity
Sample:
I am O Li B-per fruit I-per frozen E-per, O I love O Zhong B-loc guo E-loc, O I come from O four B-loc Chuan e-loc. OCopy the code
conclusion
Basically, three labeling methods of entity recognition are simply described. From the above, we can see that all labeling methods of sequence labeling are similar with minor differences.