sequence
This paper mainly reviews the architecture and flow of NLP system.
NLP architecture
This figure is from [Popular Science of legislators: Brief Introduction of Natural Language System Architecture]
Main process steps
- Split/cut word (
Tokenization
) - Pos tagging (
POS Tagging
) - Semantic chunking (
Chunking
) - Named entity annotation (
Named Entity Tagging
)
The previous few mainly belong to NLP shallow analysis tasks, namely sequence annotation tasks.
- Syntactic analysis
- Text/semantic analysis
Chinese word segmentation
Chinese does not have Spaces for word segmentation like English, so strings of characters have to be broken down into appropriate words before the text can be analyzed.
Participle (clause to word) technique is the main part
Dictionary based word segmentation methods (maximum matching method, shortest path method, maximum probability method), the actual use of more as follows:
- Chinese word segmentation algorithm based on conditional Random Field (CRF) open source system.
- An open source Chinese Word Segmentation Algorithm based on Zhang Huaping’s NShort (
Stutter word segmentation core algorithm
).
Combining words (word to word) mainly uses the method of word sequence annotation.
Pos tagging (POS Tagging
)
A part of speech, also known as a part of speech, is a grammatical attribute of a word and a bridge that connects words to syntax. Part-of-speech Tagging (POS Tagging), also known as Part of Speech Tagging, refers to determining the grammatical role that each word plays in a sentence.
Most of these techniques use HMM (hidden Markov model) + Viterbi algorithm, and Maximum Entropy algorithm. Currently, there are two popular Chinese part-of-speech tags: PKU part-of-speech tagging set and Penn Part-of-speech tagging set.
Modern Chinese words can be divided into two categories of 12 parts of speech: one is content words: nouns, verbs, adjectives, numerals, quantifiers and pronouns; The other is function words: adverbs, prepositions, conjunctions, auxiliary words, interjections and onomatopoeia.
Semantic chunking (Chunking
)
Some words are grouped together to form subject, predicate, object and so on according to the syntactic structure of the sentence marked with good parts of speech;
The most common method of semantic chunking is Conditional Random Fields (CRF)
Named entity annotation (Named Entity Tagging
)
Named entity recognition is used to identify entities with specific meanings in texts, including people’s names, place names, organization names and other proper nouns. The task of named entity recognition also identifies named entities in three categories (entity, time and number) and seven sub-categories (name of person, organization, place name, time, date, currency and percentage).
The techniques used are the standard HMM model and the Viterbi algorithm.
Syntactic analysis
Syntactic analysis is to automatically derive the grammatical structure of a sentence according to a given grammatical system, analyze the grammatical units contained in the sentence and the relationship between these grammatical units, and convert the sentence into a structured grammar tree.
The main theories of syntactic analysis are as follows:
- Analysis of phrase structure grammar
- Dependency parsing
Text/semantic analysis
It mainly includes text similarity analysis, text keyword extraction, text classification, content summary and sentiment tendency analysis. Among them, semantic analysis involves anaphora resolution and other technologies. Text classification can use naive Bayes algorithm.
summary
This paper mainly analyzes the architecture and main flow of NLP system, which is convenient for further targeted study.
doc
- [Zhaohua Bit by bit: The evolution of Millions architecture Slide]
- 【 Legislator popular Science: Brief Introduction of natural Language System Architecture 】
- What is the difference between POS Tagging and Chunking/Shallow Parsing?
- Basic technology of Baidu language processing
- NLTK Reading Notes – Information Extraction (2)
- What is the relationship between parsing and semantic analysis in NLP?
- Principle and practice of NLP Chinese Natural Language Processing