“This is the 7th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
I. Overview of the field
Language is a phonetic and character system with unified coding and decoding standards developed for daily communication between biological species. The charm and uniqueness of language lies in the fact that different contexts will give different meanings, which requires matching logical thinking to understand and carry out dialogues. If the conversation takes place between two people without similar thinking and experience, the communication will not be smooth and the expression of meaning will not be clear, which greatly increases the cost of communication. With the development of transportation and information, cultural communication is no longer a challenge between people. Up to now, the computer serves the human machine, the human has been trying to communicate with the computer through natural language, people hope to communicate with the machine in the way of communicating with people, and let the machine understand the language has become a prerequisite for the development of artificial intelligence. Natural language processing was born. NLP(Natural Language Programming), as its name implies, studies how to process Natural Language. Simply put, it means that machines need to learn what people express, how to understand, and how to imitate people to speak back. Natural Language is the phonetic and character system that we use daily. The process of making machines learn is a lot of training.
Natural language processing is an important field of computer and artificial intelligence. However, both understanding and producing natural languages are not as simple as might be imagined. We describe in words everyday experiences and ideas, the literature of the formation was in addition to reading, often need to be more operations, and translate it into other languages, for example, in this paper, the content, reading comprehension from the content to find an answer, or find what this paper referred to the things and their relationship, and so on. Although all of these needs can be addressed by human reading, the sheer volume of literature leads to a serious shortage of human word processing capacity. The invention of computers in the 1940s made it possible to process information with machines rather than humans. As early as the 1950s, natural language processing became a field of study in computer science. Until the 1980s, however, NLP systems were based on a complex set of manual rules that computers simply executed mechanically or performed simple calculations such as character matching and word frequency statistics. The rise of machine learning in the late 1980s brought new ideas to NLP. In word processing, rigid artificial rules are increasingly being replaced by flexible probability-based statistical models. In recent years, with the development of deep learning, multivariate neural network has also been introduced into the field of NLP and become a problem solving technology.
Currently, NLP deals with a multitude of issues, each of which needs to be discussed in the context of a scenario and specific requirements. How to help different products translate and interact with users’ needs, and how to analyze searchers’ questions in the search box and match the most similar results. Complex, how to understand the drivers demand and planning a reasonable route, how to identify other language context semantics, and matching similar thinking in Chinese translation as a result, how to understand small speakers outside the one hundred million times the instructions and quickly understand and feedback to match to the instruction and dialogue, how to integrated a variety of information and requirements and reorganize a language to generate a new article. However, these problems also have a lot in common, which allows us to categorize various kinds of problems into various types of tasks, such as text categorization, text matching, sequence labeling, and reading comprehension. For these tasks, NLP personnel have explored a variety of methods that correspond to different technologies. When intelligent voice interaction has become a new living habit of Chinese people, and when language is no longer a barrier to communication, NLP has unconsciously been integrated into people’s lives, influencing people’s living habits and promoting the development of artificial intelligence.
Second, the main problems
1. Difficulty in understanding natural language
Natural language comprehension is an important part of natural language processing and its effect directly affects the correctness of subsequent processing. At present, however, there are still difficulties in natural language understanding. On the one hand, be understood not just by a sentence grammar, sentence not only literal meaning, often also use a lot of other knowledge, including the life common sense and the specialized knowledge, real-time update of knowledge, these knowledge can’t completely stored in the computer, it is unable to be flexible to quickly call all kinds of knowledge to understand the semantic. On the other hand, a natural language understanding system can only analyze an isolated sentence based on a limited range of vocabulary, sentence patterns and specific topics. The influence of contextual content, tone of context and speaker’s personality on sentence meaning is mostly ignored, and the grammar of natural language is usually ambiguous. The semantic problems caused by polysemy, ellipsis and pronoun reference are also lacking in systematic research, and the different standards of sentiment segmentation data sets annotated by people make the results of sentiment analysis fail to reach the desired effect.
2. The dilemma of NLP in low resource state
According to UNESCO, of the more than 7,000 recorded languages in the world, more than 400 are endangered and more than 200 are near endangered. With the development of economy and society, a large number of minority languages rich in language characteristics and cultural connotations are rapidly disappearing. The preservation of the original language and cultural information as well as the preservation of the precious endangered language and cultural heritage is an urgent goal and pursuit of the present era. However, NLP mostly focuses on the research of more than 20 popular languages in the world, such as English and Chinese, and seldom applies low resource corpus such as minority languages and regional dialects. NLP training generally require a large amount of data annotation set, and the resources of language itself is only a small amount of high quality low spoken corpora, relying only on the single language data, the researchers behind the message semantics, cannot be obtained to apply NLP resource-based language migration from high to low resource language is a big problem in the current, so difficult to efficiently develop related learning and research, Let alone the survival of these minority languages and the vitality of the local cultures behind them.
3. The task of detecting and tracking false information is arduous
False information detection aims to verify news reports through artificial intelligence technology and identify fraudulent rumors and false information, which is a hot topic in the application field of NLP. NLP false information detection model generally obtains data from open online social media, further analyzes the news content, user behavior attributes, and communication mode, and evaluates the credibility of news through the word-of-mouth construction characteristics of news sources. News published by government websites and recognized authoritative media will be “true” by default. However, in the early detection stage of false information, that is, when information is released on news channels but not yet spread on social media, features cannot be extracted by relying on news dissemination information and user behavior, because they need to be accumulated over a period of time, so new false news cannot be detected in a timely manner. It can be seen that the characteristics of artificial construction have many disadvantages such as one-sided consideration and waste of manpower.
After the emergence of false information, the tracking of online public opinion is particularly important to the control of online public opinion. Only by effectively controlling the evolution path of online public opinion can losses be stopped in time. At present, NLP has made many attempts to study the technological evolution path based on topic model or social network analysis, but all of them have deficiencies in methodology. First, many studies only consider a single type of text information, which can lead to incomplete technical path mapping. Second, topic changes from period to period cannot be determined automatically, which makes dynamic topic tracking difficult. Thirdly, data mining methods fail to combine effectively with visual analysis, which affects the efficiency and flexibility of mapping. It remains challenging to effectively track valuable information based on the evolution of topics, emotions, and behavioral changes.
Application and prospect
NLP is an important link in the field of artificial intelligence. The progress of NLP will promote the development of artificial intelligence. Over the past two decades, NLP has made great strides in many areas, using the results of machine learning and deep learning research. The next decade will be the golden age of NLP development.
In the future, text big data from various industries will be better collected, processed and stored, and the demand for NLP from search engines, customer service, business intelligence, voice assistants, translation, education, law, finance and other fields will be significantly increased, and the quality of NLP will also be higher. The multimodal fusion of text data, voice data and image data will gradually become the demand of future robots.
Therefore, NLP research will be tilted towards the following aspects:
L Bring knowledge and common sense to current data-based learning systems.
L Learning methods for low-resource NLP tasks.
L Context modeling, multiple rounds of semantic understanding.
L Interpretable NLP based on semantic analysis, knowledge and common sense.
An ideal future NLP system architecture might be the following generic framework for natural language processing:
(1) Firstly, basic processing of given natural language input is carried out, including word segmentation, part-of-speech tagging, dependency analysis, named entity recognition, intention/relation classification, etc.
(2) Secondly, the encoder is used to encode the input and convert it into the corresponding semantic representation. In this process, on the one hand, pre-trained word embedding and entity embedding are used to expand the information of words and entity names in the input; on the other hand, pre-trained multi-task encoders can be used to encode input sentences and fuse different codes through transfer learning.
(3) Next, based on the semantic representation of the encoder output, task-related decoders are used to generate the corresponding output. Multi-task learning can also be introduced to introduce other related tasks as auxiliary tasks into the model training of the main task. If multiple rounds of modeling are required, important information about the output results of the current round should be recorded in the database and applied in subsequent understanding and reasoning.
Obviously, a lot of work needs to be done to implement this ideal NLP framework:
(1) The need to build a large-scale common sense database and promote relevant research clearly through meaningful evaluation;
(2) Study more effective encoding methods of words, phrases and sentences, and build more powerful pre-trained neural network models;
(3) To promote unsupervised learning and semi-supervised learning, we need to consider the use of a small amount of human knowledge to enhance learning ability and build a new method of cross-language embedding;
(4) The effectiveness of multi-task learning and transfer learning in NLP task should be reflected more effectively, and the role of reinforcement learning in NLP task should be enhanced, such as the application of multi-round conversation in automatic customer service;
(5) Effective discourse level modeling or multiple rounds of session modeling and multiple rounds of semantic analysis;
(6) User factors should be considered in system design to achieve user modeling and personalized output;
(7) To build a new generation of expert system based on domain knowledge and common knowledge by using inference system, task solving system and dialogue system comprehensively;
(8) Use semantic analysis and knowledge system to improve the interpretability of NLP system.
In the future, NLP, along with other artificial intelligence technologies, will profoundly change human life. Of course, bright future and tortuous road are the eternal truth, in order to realize this beautiful future, we need bold innovation, rigorous and realistic, solid progress, together into the next brilliant decade of NLP.