This article is participating in Python Theme Month. See the link for details

What is information extraction

Definition: A task that automatically extracts structured information from unstructured or semi-structured text

Classification of information extraction

1. According to whether the extraction result is in the original text, information extraction can be classified into extraction type and generation type. The extraction model is relatively rigid and will return whatever information there is in the original text, while the generation model is relatively humane and will sort out the target information. Examples are as follows:

The removable:

Fasting blood glucose was controlled at 7mmol/L -----> fasting: 7mmol/LCopy the code

Generate type:

The tumor was raised and ulcerated, located in the gastric antrum -----> Borrmann classification: polyp typeCopy the code

2. According to the extraction results, information extraction can be divided into entity extraction, relationship extraction and event extraction. Here are some examples.

Entity extraction:

Disease: interstitial pneumoniaCopy the code

Relational extraction:

Pain -- site of disease --> both kneesCopy the code

Event extraction:

Pathological events: Description: round like; Lesion size 27*28; Lesion site: left lower lung dorsal segmentCopy the code

General evaluation standard

  • Accuracy rate P: the accuracy of the results predicted by the model, the number of correct predictions/the number of predicted results
  • Recall rate R: what the model misses, the number of correct predictions/the number of manual annotations
  • F1: comprehensive evaluation index, 2*P*R/(P+R)

Application of information extraction

  • Medical treatment, extracting names of diseases, medicines, symptoms, etc., from medical texts
  • Finance, extract risk warning related information from financial documents, etc
  • Law, extract key information from previous cases to assist judgment, or construct legal knowledge atlas
  • Building knowledge map

Key points of the information extraction model

Decoding design is the key element of information extraction model, and decoding design is the transformation process from label to structured information, which is the core of the model. There are three common decoding designs:

  • Sequence labeling: single or multi-category labeling, commonly used BIO decoding or BMEO decoding
  • Pointer: Marks the start and end of the extraction result
  • Token Pair: Indicates the category between two words in a sentence

Entity extraction

1. Definition: Extract text contents from text and identify them as predefined categories. Examples are as follows:

译 文 : The patient developed persistent pain in the left chest and back. Chest CT plain scan was performed in the hospital of Integrated Traditional Chinese and Western Medicine. Results: Symptoms -> persistent pain in the left thorax and backCopy the code

2. BIO is commonly used for decoding design

3. There are two difficulties in the actual project:

  • Entity overlap nesting refers to the shared segment among multiple entities in the text. For example, in “patient has persistent pain in left chest and back”, “left chest and back” is a site entity, and “persistent pain” is a symptom entity. It may be difficult to identify the complete “persistent pain in left chest and back”
  • Discontinuous entity refers to an entity consisting of multiple discontinuous segments, such as “anterior thoracic mass” and “dorsal mass”. It is also difficult to identify the two entities, “anterior thoracic mass” and “dorsal mass”, because they cannot be labeled by conventional BIO

Relationship between extraction

1. Definition: extract a pair of entities and pre-defined relationship types from the text to get an entity-relationship triad containing semantic information. By default, the relationship between the two entities is oriented, which is called header and tail entities. For example:

Results: (Right lower lung, site disease, inflammation)Copy the code

2. The solution is usually divided into two steps. The first step is basic entity extraction, and the second step is to use a pair of entities and sentences to judge the relationship. The second step is also called relational classification, which refers to judging the types of relationships between entities given a pair of entities and texts. Usually, the model scheme extracted by multi-step model is called Pipeline, and the scheme extracted by only one model is called Joint. This paper introduces Joint.

3. Common decoding methods include BIO, Pointer, TPLinker, etc. The latter one is more complex than the previous one, but it is also more powerful.

4. There are two common difficulties:

  • Overlap refers to an entity belonging to multiple relationships, such as “a little inflammation in the right lower lung, see nodules”, where “right lower lung” is related to both “inflammation” and “nodules”.

  • Solid pair combination refers to multiple combination methods of solid pair. For example, “a small nodule can be seen in the left lung and the right lung” and “a small nodule and ground glass shadow can be seen in the left lung and the right lung” have different extraction methods of solid pair relationship, so a more complex model is needed to cover the relationship extraction of different methods.

5. Derivative issues are as follows:

  • Remote monitoring noise can be used to construct training data because there is not much training data in the general case

  • Document-level extraction, there are a lot of entities in the document, but there will be referential ambiguity and other problems, so it is also very difficult

  • Open domain relation extraction, in which the relations in the open domain are not predefined, requires the model to automatically identify the subject-verb-object triplet, which is often used in the process of constructing knowledge maps.

Event extraction

1. Definition: Extract predefined event trigger words and event elements from a text and combine them into response organizational information. In addition to events, the results of information extraction in practical applications may be more complex, but they can be separated into relational extraction.

2. A common solution is to turn events into relational extraction problems

conclusion

The main knowledge points involved in the article are:

Entity extraction: overlapping nested entities, discontinuous entities, BIO relation extraction: overlapping relation, entity-pair combination, BIO, Pointer, TPlinker event extraction: complex structure extractionCopy the code

BIO, Pointer and TPlinker models can basically solve all the tasks of information extraction, because now this field is a unified state, a complex model can solve all kinds of problems.

Python Project Recommendations

  • Entity extraction:

    Github.com/buppt/Chine… Implementation)

  • Relational extraction:

    Github.com/buppt/Chine… Implementation) github.com/lvjianxin/R… Implementation) github.com/Jacen789/re… Bert)

  • Event extraction:

    Github.com/benkang-che…