preface

This paper mainly introduces the overview of NATURAL language processing, including the original intention of this column and the prospect of NLP, the concept of natural language processing, the application of natural language processing and the core technology of natural language processing.

First, the original intention of the column and NLP prospect

1. Original intention of series design

The original purpose of writing this series:

  • Cultivating qualified NLP/AI Talent At present, there are still plenty of gaps in the NLP talent market.
  • NLP is the hottest direction in the field of AI at present. Since 2010, AI has become very popular, and deep learning has promoted the rapid development of AI. Natural language processing has become extremely popular in recent years, with its explosion (2015) coming later than CV (2012). Although starting a few years later than CV, it is very strong and is expected to remain so for the next 2-3 years.
  • There is currently no particularly systematic, detailed NLP series
    • Deep learning prior to technology
    • Methodology based on deep learning
  • NLP has developed very fast in recent years, and its knowledge has been updated rapidly
    • BERT

At the same time, natural language processing is in a period of rapid development, almost exponential growth:

Artificial intelligence is divided into three major areas, computer vision (CV), natural language processing (NLP) and speech recognition. Computer vision mainly processes and analyzes visual data such as pictures and videos. Natural language processing analyzes text-like data. From this perspective, wherever there is textual data there is a need for NLP technology. At present, even in the fintech sector, there is a lot of demand for text analysis, such as reading news, researching newspapers to analyze public opinion in the market, or doing the analysis of events.

Over the past few years, you can see a clear trend: text data is growing exponentially. This is inseparable from the explosion of data brought by the mobile Internet. Imagine the amount of text data carried by the social apps we use every day, such as wechat and Douyin. The rapid increase of text data is bound to be accompanied by the rapid increase of the industry’s demand for text analysis, and with it, the demand for NLP talent.

At the same time, NLP is a faster choice and development direction of AI for beginners, and the threshold is lower than CV.

2.NLP position benefits

According to the current market situation, NLP engineer’s salary is very ideal. The salary data can be found on major domestic recruitment websites such as BOSS Zhipin and Lagou.com. A fresh graduate in the first-tier city engaged in NLP engineers, the general starting salary also have 15K per month (in accordance with the lower standard), if it is a good background, some strong ability of this year’s students can go to fight for more than 20K-25K monthly salary is completely ok.

Even in the United States, where AI technology is most developed, the salary of NLP talents is still high. You can check the salary of foreign AI talents through websites like Glassdoor. All in all, there is nothing wrong with joining NLP at this time, the hole will still exist at least for the next 2-3 years, but the market will become more demanding for talent. So, the earlier you enter the industry, the greater the advantage.

3. How to learn NLP

As can be seen, for AI direction, machine learning is a necessary foundation;

Data structures and algorithms can improve efficiency;

At the same time, it is necessary to choose a specific direction for systematic learning, such as CV and NLP.

In addition, must have certain programming foundation;

After the introduction of a direction, need to go deep along a technical route or application scenario, go deep into a certain field, until master the details of a certain aspect, and strive to develop t-shaped talents;

Must have good reading ability of paper, including English literature.

The biggest cost of learning itself is time, and the series strives to be self-contained, just taking this course is enough to become a qualified NLP engineer. Other required skills such as mathematics, data structures and algorithms are also interspersed throughout the series, so it is recommended to review and summarize these points while learning.

What else do you need to read? My view is not very much. On the one hand, the contents of books are easily out of date; On the other hand, the knowledge in the book is too “too much”, resulting in a lot of time, but not necessarily a lot of harvest. I can study around the series, and read some online blogs, papers and other ways to match the contents of the series, which is what I find most effective in the learning process.

If you have to recommend books, I will recommend these two books, it is very classic, although not directly related to NLP. One was Machine Learning: A Probabilistic Perspective by Kevin Murphy, which focused on Machine Learning. We need to know that machine learning is the most important foundation for both NLP and CV, and that understanding of machine learning determines how far we can go in AI technology. The other book is “Convex Optimization” by Boyd. Optimization theory is the core of AI. The training process of the model can be understood as the process of optimizing the model so as to find the optimal parameters. Behind the scenes are various optimization algorithms. Putting these two books on your bookcase will also make you look professional!

What is natural language processing

1. What is natural language processing

Three concepts of natural language processing:

  • Natural Language Processing (NLP)
  • Natural Language Understanding (NLU) — Understanding meaning in text
  • Natural Language Generation (NLG0) — Generates text based on meaning

Human beings communicate in three ways, voice, image and text. The main task of NLP is to understand and generate text, so there is a formula:


N L P = N L U + N L G NLP=NLU+NLG

The diagram from NLU to NLG is as follows:

Generally speaking, NLP studies how to better understand the meaning behind the text (NLU) on the one hand, and how to generate the text (NLG) according to the meaning expressed on the other hand.

2. Why is natural language processing difficult

Why it’s harder to understand text than images:

The text has its meaning behind, need to understand, to figure out the implied deep meaning behind;

Images are generally what you see is what you get, without much deeper meaning behind them.

Text is harder to understand than images because of the following characteristics:

  • One meaning, multiple expressions

    We just launched a new product

    We launched a new product that was launched by our company

    The above three sentences express the same meaning.

  • polysemy

    I visited Apple Today

    It’s apple season

    A word expresses different meanings in its context. The meaning of a word cannot be obtained only by looking at its context. The BERT model captures the different meanings of a word in context.

In addition to the technology itself, an industrial natural language processing system usually involves many modules. In order to solve an NLP problem, it is necessary to do a series of “serial operations” such as text cleaning, word segmentation, correction, feature engineering, named entity recognition, classification, and so on. In fact, each operation will make errors continuously accumulate, and ultimately affect the performance of the system. So in the design of NLP system, every link is very important, can not have any neglect.

3. Start with a simple machine translation

Machine translation is an important application scenario of natural language processing. At present, machine translation system has been relatively mature, with perfect grammar, word organization and paragraph organization.

In the early stage, it was rule-based translation, that is, translation from a word to its corresponding word in the target language, while taking grammar into account.

Later, probabilistic statistics methods include traditional methods and deep learning methods, such as building machine translation systems based on generative methods. Corpus has also promoted the development of machine translation to a great extent, training more accurate models through huge corpus. Since the reliability of AI models largely comes from the accuracy, quality and volume of data, an effective model can be trained through a large amount of data.

A simple task is as follows:

Based on simple statistics, probability and elimination, the translation results are as follows:

The implementation scheme above is based on probability statistics. In the process of translation, the author keeps counting the probability of the occurrence of words to summarize a set of rules and use the rules to predict new data. This methodology is based on probability statistics, and modern AI technology is mainly based on this methodology. Almost all the models and methods you can use are derived from the idea of probability and statistics.

The simple machine translation example above does a word-by-word translation, relying on the one-to-one correspondence between the words in the two languages, but ignoring the syntax.

Examples are as follows:

Language model plays an important role in correcting grammar and plays an important role in machine translation and text generation.

The application of natural language processing

Natural language processing has many application scenarios, including intelligent question answering system, text generation, machine translation and so on.

1. Intelligent question answering system

The applications are as follows:

A question and answer system usually involves retrieval, sorting, semantic understanding, semantic similarity matching and other modules.

2. Text generation

Including generating reports, marketing documents, generating machine translation results, generating text summaries, writing articles, writing poems and other specific application scenarios.

3. Machine translation

Machine translation system is a very classic application scenario, using generation, probability statistics, semantic understanding and other technologies. The effect of the current translation system is also relatively good.

4. Emotional analysis

Emotion analysis is a classical application field in natural language processing with a long history. It can be regarded as dichotomous or trichotomous problem, which can be extended to news topic classification, emotion classification, intention classification and so on. In general, short texts are more difficult to classify than long texts because long texts contain more information. Specific can be used in e-commerce customer satisfaction, intelligence, banned product analysis, public opinion monitoring, quantitative investment and other segmentation fields.

chatbots

Chatbots can realize functions such as ordering without people.

It can be divided into talkative robots and task-oriented robots: talkative robots use generative methods, including SEQ2SEQ, Transformer and other models; Task-based, preferring to use Slot Filling.

Chatbots can also be built in a similar way to question-and-answer systems, that is, retrieval.

6. Fake news detection

In essence, it is a dichotomous problem, which divides news texts into true news or false news, similar to sentiment analysis. Social networks can also be integrated to improve detection accuracy.

7. Text subject classification

News websites classify news according to news text, and then make personalized recommendation to users.

8. Information extraction

Given unstructured data (including text, video, audio, etc., which cannot be stored in traditional relational database), key information is extracted to form structured data and stored in database. Each field can be used as a feature and can be used as knowledge for the AI to learn from.

In addition to the above applications, there are many more application scenarios. Even for one application, we can spawn many different tasks. Pick a subject that interests you the most and dig into it yourself. There are many ways to dig deeper, such as to systematically look at the knowledge of the field, or to investigate all relevant articles. Hopefully, at the end of the series, you will still have a deep understanding of a certain area, which is very necessary.

Core technology of natural language processing

1. Three dimensions of natural language processing technology

The three dimensions are:

  • The meaning, part of speech, etc of a word.
  • Sentence structure Syntax analyzes sentence components based on language grammar and obtains Syntax tree, so as to get the relationship between different modules of a sentence.
  • Semantic understanding of the meaning behind a sentence.

2. Several key technologies of natural language processing

(1) Word Segmentaition

A participle is a given sentence divided into a series of words.

Obviously, Chinese word segmentation is more complicated than English word segmentation. At the same time, word segmentation is already the simplest NLP task, which can be used as a solved problem with high accuracy and provide services for the upper modules.

(2) Part-of-speech Tagging

Part-of-speech analysis is also a very basic and important task, and the results can be treated as subsequent features.

(3) Semantic Understanding

Above is the BERT model.

A better representation of each word leads to a better representation of each sentence; With a better representation of each sentence, there is a better representation of each text. This is a progressive process. Much of the value of NLP comes from semantic understanding.

(4) Named Entity Recognition

An entity is a real object that exists in real life, such as a person, place name, company name, organization name, time, position, product name, etc. Entities often carry important information that is important for semantic understanding. Entities need to be identified and annotated. It’s also the basis for chatbots, knowledge graphs, and more.

(5) Dependency Parsing

Dependency grammar analysis is a very important part of grammar analysis. It is necessary to find out the dependency relationship between each part of a sentence, which can excavate some valuable information and play a very important role in the following tasks.

Parsing Parsing

Syntactic analysis is also an important part of grammar analysis, which analyzes each component and structure of a sentence, so as to obtain a grammar tree. There are few practical application scenarios. Its value is less than dependency grammar analysis.

The techniques mentioned above can be regarded as relatively basic tasks, and their success or failure greatly affects the performance of subsequent tasks. For example, to build a knowledge map in the medical field, it needs strong named entity recognition technology; For example, to better analyze short text, you need more powerful short text semantic understanding modules and so on.

3. An overview of natural language processing technologies

Terms related to NLP are as follows:

conclusion

As a very hot field, artificial intelligence has been applied to all aspects of people’s life. At the same time, artificial intelligence is divided into many subdivisions, among which natural language processing has developed rapidly and been widely used in recent years. The technologies of word segmentation, part of speech analysis, semantic understanding and so on have been very mature, and intelligent question answering, machine translation, emotion analysis and so on are also widely used. Therefore, NLP is still an optional development area in the future.

In this paper, the original reproduced ask my community, the original link www.wenwoha.com/19/course_a… .