NLP system architecture and main flow

sequence

This paper mainly reviews the architecture and flow of NLP system.

NLP architecture

This figure is from [Popular Science of legislators: Brief Introduction of Natural Language System Architecture]

Main process steps

Split/cut word (Tokenization)
Pos tagging (POS Tagging)
Semantic chunking (Chunking)
Named entity annotation (Named Entity Tagging)

The previous few mainly belong to NLP shallow analysis tasks, namely sequence annotation tasks.

Syntactic analysis
Text/semantic analysis

Chinese word segmentation

Chinese does not have Spaces for word segmentation like English, so strings of characters have to be broken down into appropriate words before the text can be analyzed.

Participle (clause to word) technique is the main part

Dictionary based word segmentation methods (maximum matching method, shortest path method, maximum probability method), the actual use of more as follows:

Chinese word segmentation algorithm based on conditional Random Field (CRF) open source system.
An open source Chinese Word Segmentation Algorithm based on Zhang Huaping’s NShort (Stutter word segmentation core algorithm).

Combining words (word to word) mainly uses the method of word sequence annotation.

Pos tagging (`POS Tagging`)

A part of speech, also known as a part of speech, is a grammatical attribute of a word and a bridge that connects words to syntax. Part-of-speech Tagging (POS Tagging), also known as Part of Speech Tagging, refers to determining the grammatical role that each word plays in a sentence.

Most of these techniques use HMM (hidden Markov model) + Viterbi algorithm, and Maximum Entropy algorithm. Currently, there are two popular Chinese part-of-speech tags: PKU part-of-speech tagging set and Penn Part-of-speech tagging set.

Modern Chinese words can be divided into two categories of 12 parts of speech: one is content words: nouns, verbs, adjectives, numerals, quantifiers and pronouns; The other is function words: adverbs, prepositions, conjunctions, auxiliary words, interjections and onomatopoeia.

Semantic chunking (`Chunking`)

Some words are grouped together to form subject, predicate, object and so on according to the syntactic structure of the sentence marked with good parts of speech;

The most common method of semantic chunking is Conditional Random Fields (CRF)

Named entity annotation (`Named Entity Tagging`)

Named entity recognition is used to identify entities with specific meanings in texts, including people’s names, place names, organization names and other proper nouns. The task of named entity recognition also identifies named entities in three categories (entity, time and number) and seven sub-categories (name of person, organization, place name, time, date, currency and percentage).

The techniques used are the standard HMM model and the Viterbi algorithm.

Syntactic analysis

Syntactic analysis is to automatically derive the grammatical structure of a sentence according to a given grammatical system, analyze the grammatical units contained in the sentence and the relationship between these grammatical units, and convert the sentence into a structured grammar tree.

The main theories of syntactic analysis are as follows:

Analysis of phrase structure grammar
Dependency parsing

Text/semantic analysis

It mainly includes text similarity analysis, text keyword extraction, text classification, content summary and sentiment tendency analysis. Among them, semantic analysis involves anaphora resolution and other technologies. Text classification can use naive Bayes algorithm.

summary

This paper mainly analyzes the architecture and main flow of NLP system, which is convenient for further targeted study.

doc

[Zhaohua Bit by bit: The evolution of Millions architecture Slide]
【 Legislator popular Science: Brief Introduction of natural Language System Architecture 】
What is the difference between POS Tagging and Chunking/Shallow Parsing?
Basic technology of Baidu language processing
NLTK Reading Notes – Information Extraction (2)
What is the relationship between parsing and semantic analysis in NLP?
Principle and practice of NLP Chinese Natural Language Processing

NLP system architecture and main flow

sequence

NLP architecture

Main process steps

Chinese word segmentation

Pos tagging (POS Tagging)

Semantic chunking (Chunking)

Named entity annotation (Named Entity Tagging)

Syntactic analysis

Text/semantic analysis

summary

doc

Related Posts

Java VIRTUAL Machine 2: Memory management

Linux file find summary

Small white start micro services (4) – Service registration and service discovery

Pos tagging (`POS Tagging`)

Semantic chunking (`Chunking`)

Named entity annotation (`Named Entity Tagging`)