NLP Natural Language Processing (NLP) is an artificial intelligence specialized in analyzing human Language. Is in the machine language and human language communication bridge, in order to achieve the purpose of human-machine communication. Before artificial intelligence, machine intelligence worked with structured data (like data in Excel). However, most of the data in the network is unstructured, such as: articles, pictures, audio, video… Among unstructured data, text has the largest quantity. Although it does not occupy as much space as pictures and videos, it has the largest amount of information. In order to analyze and utilize the text information, we need to use NLP technology to make the machine understand the text information and use it.

The main contents of NLP are as follows:

Two core tasks:

  1. Natural Language Comprehension – NLU
  2. Natural language generation – NLG

Five difficult points:

  1. Language has no rules, or rules are complex.
  2. Languages can be freely combined and complex expressions can be combined.
  3. Language is an open set, and we can invent new expressions at will.
  4. Language needs to be associated with practical knowledge and has certain knowledge dependence.
  5. The language is used based on context and context.

Four typical applications:

  1. Sentiment analysis
  2. Chatbot
  3. Speech recognition
  4. Machine translation

Six implementation steps:

  1. Participle – tokenization
  2. Secondary dry extraction – Stemming
  3. Word reduction – Lemmtranscend
  4. Pos tagging – POS tags
  5. Named Entity Recognition – NER
  6. Block – chunking

Natural Language Comprehension – NLU

What is Natural Language understanding (NLU)?

N LP (Natural Language Processing) is a technology that uses Natural Language to communicate with computers. Because the key to processing Natural Language is for computers to “understand” it,Natural Language processing is also called NLU (Natural Language Understanding), also known as Computational linguistics. On the one hand, it is a branch of language information processing, on the other hand, it is one of the core topics of Artificial Intelligence (AI).

For example, the smart speakers we usually use:

I think listening to music

Put on a song, some music…

There are many kinds of natural expressions, and there are endless combinations of natural expressions that represent the intention of “listening to music.” Understanding so many different expressions is a challenge for the machine. In the early days, machines could only process structured data (like keywords), meaning that the user had to type in precise instructions to understand what a person was saying. And the key words of these instructions are bound to be misinterpreted in different expressions, so they look stupid. Natural language understanding, when it comes, allows machines to sort through natural language expressions, which ones belong to that intention; Expressions that don’t fall into this category, and don’t rely on such rigid keywords. Another example is smart audio:

“Too loud”

Machine: “I have turned down the volume for you”

The user doesn’t mention volume, but the machine needs to know the user’s intentions — it’s too high and needs to be turned down.

2. Application of natural language understanding (NLU)

  • Machine translation (Youdao, Baidu Translation, etc.)
  • Machine customer service (machine customers in various apps)
  • Smart speaker (Xiaoai Speaker, Tmall Genie, etc.)

3. Difficulties in natural language comprehension (NLU)

Difficulty 1: Diversity of language There are many different expressions in natural language, and the combination method is very flexible. Different combinations can express multiple meanings, and many exceptions can always be found.

Difficulty 2: Language ambiguity without context, lack of environmental constraints, language has great ambiguity.

Difficulty 3: Robustness of language In the input process of natural language, especially the text obtained through speech recognition, there will be many words, few words, wrong words, noise and other problems.

Difficulty 4: Language knowledge depends on language is a symbolic description of the world, and language naturally connects with world knowledge.

Difficulty 5: Language context The concept of context includes many kinds: dialogue context, device context, application context, user portrait.

4. Implementation of NLU

Natural language understanding has gone through three iterations:

  1. Rule-based approach
  2. Statistics-based approach
  3. Methods based on deep learning

Summarizing the law to judge the intention of natural language, common methods include CFG, JSGF, etc. The common NLU methods based on statistics include SVM and ME. With the explosion of deep learning, CNN, RNN and LSTM have all become the latest mainstream, and Transformer is the most advanced method at present.

Natural language generation – NLG

NLG is designed to bridge the communication gap between humans and machines by converting data from non-verbal formats, such as articles and reports, into language formats that humans can understand

1. Natural language generation -NLG has two ways:

  1. Text-to-text: text-to-language generation
  2. Data-to-text: data to language generation

2. Three levels of NLG

Simple data merging: A simplified form of natural language processing that allows data to be converted to text (via Excel like functions). Templated N NLG: This form of NLG uses template-driven mode to display output. Data keeps changing dynamically and is generated by a predefined set of business rules, such as if/else loop statements. Advanced NLG: This form of natural language generation is just like humans. It understands intent, adds intelligence, considers context, and presents the results in an insightful narrative that users can easily read and understand.

3. NLG’s 6 Steps

As a first step, the NLG system needs to decide what information should and should not be included in the text being built. Often the data contains more information than is ultimately conveyed. After determining what information needs to be conveyed, the NLG system needs to organize the Text in a reasonable order. For example, when reporting a basketball match, they will express “what time”, “where” and “which 2 teams” first, then “the overview of the game”, and finally “the end of the game”. Not every piece of information needs to be expressed in a single Sentence. Combining multiple pieces of information into a single Sentence may be more fluid and easier to read. Step 4: Lexicalisation – When each sentence has been defined, the message can be organised into natural language. This step adds connectives to each message to make it look more like a complete sentence. Step 5: reference Expression – Referring Expression Generation | REG this step with grammaticalization are similar, is to choose some words and phrases to form a complete sentence. However, the essential difference between it and legalization is that “a REG needs to identify the domain of the content and then use the vocabulary of that domain (rather than other domains)”. Finally, when all the relevant words and phrases have been identified, they need to be combined into a well-structured complete sentence.

4. Three typical applications of NLG

Automatic writing (automatic writing of news, automatic writing of papers, etc.) Chat robot (built-in chat robot developed for various mobile phones, smart audio, shopping mall navigation robot, etc.) BI interpretation and report generation (interpretation report generation for all walks of life, such as physical examination report)


Word segmentation is to decompose long texts such as sentences, paragraphs and articles into data structures with words as units, which is convenient for subsequent processing and analysis. With deep learning, [word segmentation] can also be carried out in some jobs. General participles we use Chinese and English participles, which are different from each other

  • Difference 1: Chinese is more difficult due to the different ways of word segmentation

English has a natural space separator, but Chinese does not. Therefore, how to segment is a difficult problem. In addition, there are many meanings of a word in Chinese, which leads to ambiguity.

  • Difference 2: English words have many forms

There are abundant deformations in English words. To cope with these complex transformations, English NLP has some unique processing steps compared with Chinese, which are called Lemmatization and Stemming extraction. Chinese does not need part of speech restoration: does, done, doing, did needs part of speech restoration to do. Cities, children, teeth, need to be converted to city, child, tooth.

  • Difference 3: Granularity should be considered in Chinese word segmentation

The larger the granularity, the more accurate the meaning, but also leads to fewer recalls.

Word segmentation methods can be roughly divided into three categories:

  1. Dictionary-based matching
  2. Based on statistical
  3. Based on deep learning

Advantages: fast speed, low cost disadvantages: not strong adaptability, big difference in effect in different fields my blog is a word segmentation method based on dictionary matching: C# to achieve forward maximum matching, dictionary tree (word segmentation, retrieval)

Advantages of statistics-based word segmentation: strong adaptability Disadvantages: high cost and slow speed These algorithms are commonly used at present, such as HMM, CRF, SVM, deep learning and other algorithms. For example, Stanford and Hanlp word segmentation tools are based on CRF algorithm.

Based on deep learning advantages: high accuracy, strong adaptability Disadvantages: high cost and slow speed. For example, some people try to use bidirectional LSTM+CRF to realize word segmentation, which is sequence annotation in essence, so it has generality. This model can be used for named entity recognition, and it is reported that the character accuracy of word segmentation can be as high as 97.5%.

GitHub star index (GitHub)

  1. Hanlp
  2. Stanford participle
  3. Ansj participle
  4. Harbin institute of LTP
  5. KCWS participle
  6. jieba
  7. IK
  8. THULAC, Tsinghua University

English word segmentation tools

  1. Keras
  2. Spacy
  3. Gensim
  4. NLTK

Stem extraction STEMMING and morphology reduction LEMMATISATION

Stem extraction and form reduction are important steps in English corpus preprocessing. English words have many forms, which need part of speech reduction and stem extraction, but Not Chinese! Stem extraction is the process of removing prefixes and suffixes to get roots. Common affixes are “plural noun”, “progressive tense”, “past participle”… And so on to extract the stem. For example, [dogs] extracts [dog]. Stem extraction is more widely used in the field of information retrieval, such as Solr, Lucene, etc., to expand the search, coarse granularity.

Word form reduction is based on the dictionary, transforming the complex form of a word into its most basic form. Restoration does not simply remove the suffix or prefix, but converts words according to the dictionary. For example, [drove] converts to [drive]. Form reduction is mainly used in text mining, natural language processing, and text analysis and expression with finer granularity and more accuracy.

Three main stem extraction algorithms: Porter, Snowball, Lancaster

Part-of-speech tagging -PART OF SPEECH

Part-of-speech tagging (POS tagging) is also called grammatical tagging or word-category disambiguation. Corpus linguistics is a text data processing technology that marks the parts of speech of words in corpus according to their meanings and context content. Partof speech tagging is the process of determining the grammatical category of each word in a given sentence, determining its partof speech and labeling it. The comparison table of Chinese partof speech is as follows:


Named Entity Recognition (NER) is a very basic task in NLP. NER is a fundamental tool for many NLP tasks, such as information extraction, question answering, parsing, machine translation, etc. Named entity recognition, which is what an entity is, simply understood, an entity can be thought of as an instance of a certain concept.

For example, if “name” is a concept or entity type, then “Sun Quan” is a “name” entity. “Time” is a kind of entity type, then “National Day” is a kind of “time” entity. Entity recognition is the process of picking out the type of entity you want to capture from a sentence. Only good entity recognition can make other tasks, such as event extraction and relationship extraction, more effective.

Block – chunking

Text segmentation is to divide a large piece of text into several small pieces of text, for example, to obtain a small part of a piece of text, or to segment a small part of a fixed number of words, often used for very large text. Note that text chunking is not the same as word segmentation, in that the purpose of word segmentation is to divide a text into words, whereas text chunking is to divide a large text into smaller text segments. Chunking: mark the chunks of phrases in the sentence, such as noun phrase (NP), verb phrase (VP), etc.

The last

NLP has a lot of work and technology to do, the above is just a brief introduction of the content and some concepts of NLP, the existing methods. Each of these steps can be broken up to work in different applications, or linked together to make a great product. Common tasks in NLP

This is the content and summary to learn the beginning of NLP, with many explanations and sentences copied to: AI-definiti… (Deleted).

Later, I will learn some NLP related knowledge, including the learning and sharing of [HANLP].