Natural Language Processing (NLP) is an interdisciplinary subject that integrates computer science, artificial intelligence and linguistics. The relationship between them is shown in Figure 1-1. It is the study of how, through techniques such as machine learning, computers can learn to process human language and ultimately understand human language or artificial intelligence.
In fact, there is no widely accepted definition of the term natural language processing. Scholars who care about structure like to use the term Computational Linguistics (CL), while those who care about ends prefer the term Natural Language Understanding (NLU). Because NLP sounds more engineering, this book will stick with the term rather than delve into its similarities and differences.
As complex as it is, natural language processing has always been a difficult subject. While language is only one part of artificial intelligence (which also includes computer vision, for example), it is very unique. There are many creatures on this planet with visual systems that surpass ours, but only humans have such advanced language. The goal of natural language processing is to allow computers to process or “understand” natural language in order to perform meaningful tasks, such as booking airline tickets, shopping or simultaneous interpretation. Fully understanding and expressing language is extremely difficult, and perfect language understanding is equivalent to achieving artificial intelligence.
In this chapter, we will look at some basic concepts around natural language processing thumbnails.
① The Turing Test is a famous test to determine whether a machine has artificial intelligence based on whether it can understand language as well as humans. ② Smith N. A. Linguistic Prediction [J]. Synthesis Lectures on Human Language Technologies, 2011, 4(2): 1-274.Copy the code
1.1 Natural languages and programming languages
As objects we will deal with, natural languages are highly flexible. We are so familiar with our language that it is hard to appreciate its complexity, as water is transparent to fish. Let’s compare natural language with artificial language and see how difficult it is for computers to understand our language.
1.1.1 vocabulary
Words in natural languages are richer than keywords in programming languages. In familiar programming languages, the number of keywords you can use is limited and certain. For example, the C language has 32 keywords and the Java language has 50. Although we are free to name variables, functions, and classes, these names are only discriminating symbols in the eyes of the compiler, carry no semantic information, and do not affect the results of the program. But in natural language, the vocabulary we can use is endless, and there are hardly any words that make exactly the same sense. Take Chinese as an example. The List of Common Words in Modern Chinese (Draft) published by the National Language Commission contains 56,008 entries in total. In addition, we can create all kinds of new words at any time, not just nouns.
1.1.2 structured
Natural languages are unstructured, while programming languages are structured. Structuration means that information has clear structural relationships, such as classes and members in programming languages, tables and fields in databases, which can be read and written by clear mechanisms. As an example, let’s look at two language representations of the same fact. Some object-oriented programming languages can write this:
class Company(object): def __init__(self, founder, logo) -> None: self.founder = founder self.logo = logo Copy the code
Apple = founder= Company(logo='apple')
Thus, programmers can get the founder and logo of Apple by using apple. Founder and apple. Logo. As such, programming languages provide a hierarchical template for information through the class Company structure, whereas natural languages have no such explicit structure. Human language is a linear string. Given the sentence “Apple was founded by Steve Jobs and its logo is Apple”, the computer needs to analyze the following conclusions:
“Apple was founded by Steve Jobs and its logo is Apple.
● The first “Apple” refers to Apple Inc., while the second “Apple” refers to the apple logo with a notch;
● “Jobs” is a personal name;
● “It” refers to Apple;
● The relationship between Apple and Steve Jobs is the Founder is, and the relationship between Apple’s logo with the gap is the Logo is.
These conclusions involve Chinese word segmentation, named entity recognition, coreference resolution and relation extraction, respectively. None of these tasks are currently as accurate as humans. It can be seen that a simple sentence for humans is not easy for computers to understand.
1.1.3 ambiguity
Natural language contains a lot of ambiguities which are expressed as specific meanings according to different contexts. For example, the meaning of polysemous words in Chinese can only be determined in a specific context, and there are even deliberate use of uncertain ambiguity to create humorous effects. In addition to the two meanings of “apple” mentioned above, the word “meaning” also has many meanings. Take, for example, this classic joke.
“She’s so funny,” he said. She said, “He’s so funny.” So people think they have wish and ask him to express to her. “I didn’t mean that at all!” he snapped. She said, “What do you mean by that?” Some people said, It was funny. Others say, It is nonsense. (See the sixth edition of Life Daily, November 13, 1994) [Wu Weitian, 1999] ①
In this example, the different meanings of “meaning” are specially annotated in English, which shows that it is more difficult to deal with Chinese than English.
In programming languages, however, there is no ambiguity. If a programmer inadvertently writes ambiguous code, such as two functions having the same signature, a compilation error can be triggered.
1.1.4 fault tolerance
The language in books and periodicals, even after repeated proofreading by editors, is not completely immune to mistakes. But the text on the Internet is more casual, misspellings or wrong sentences, irregular punctuation marks and so on can be seen everywhere. However, people can still guess the meaning of a sentence, even if it is most wrong. In a programming language, the programmer must ensure that spelling is absolutely correct and syntax absolutely normal, or the compiler will either relentlessly warn him or cause potential bugs.
In fact, how to deal with irregular social media texts has become a new topic, which is different from the regulated field of journalism.
① Excerpt from "Statistical Natural Language Processing" by Zong Chengqing. ② Programming languages are deliberately designed to be unambiguous deterministic context-free grammarsthat can be analyzed in O(n) time, where n is the length of the text.Copy the code
1.1.5 variability
Any language is constantly evolving, but programming languages change much more slowly and gently, while natural languages are relatively fast and noisy.
A programming language is invented and maintained by an individual or organization. Take C++, which was invented by Bjarne Stroustrup and is now maintained by the C++ standards committee. From C++ 98 to C++ 03 to C++ 11 and C++ 14, the language standard changes have been a year-by-year process, and the new version is largely forward compatible with the old version, with only a few deprecated features.
Natural languages are not invented or standardized by any one person or organization. In other words, any natural language is determined by all human beings. Although there are norms such as Mandarin and simplified Chinese characters, each of us is free to create and spread new words and uses, and we are constantly giving new meanings to old words, resulting in a huge gap between ancient Chinese and modern Chinese. In addition, Chinese is constantly absorbing vocabulary from foreign languages such as English and Japanese, and exporting Chinglish such as Niubility. These changes are continuous and take place all the time, presenting no small challenge to natural language processing. This is why natural language is called “natural” even though it is a human invention.
1.1.6 brief sex
Human language tends to be concise and succinct due to the limitations of speed of speech and obedience, writing and reading. We often omit a great deal of background or common sense, such as saying “see you at the old place” to friends without having to point out where the “old place” is. For the names of organizations, we often use abbreviation, such as “INDUSTRIAL Bank” and “local Taxation Bureau”, assuming that the other party is familiar with the abbreviation. If an object is mentioned as a topic, the pronoun is often used in the following paragraphs. In successive news reports or pages of a book, it is not necessary to repeat previous facts, assuming that the reader is already familiar with them. This omission of common sense, which is shared by both parties but not necessarily possessed by the computer, also poses a barrier to natural language processing.
1.2 Levels of natural language processing
According to the granularity of the objects to be processed, natural language processing can be roughly divided into several levels as shown in Figure 1-2.
This section provides an overview of the definitions of these natural language processing tasks.
1.2.1 Voice, Image and Text
There are three input sources of natural language processing system, namely speech, image and text. Among them, voice and image, although attracting more and more attention, still contain less information than text due to storage capacity and transmission speed. In addition, these two forms are generally converted into text after Recognition and then processed, respectively called Speech Recognition and Optical Character Recognition. Once converted to text, you can proceed with subsequent NLP tasks. So, text processing is a top priority.
1.2.2 Chinese word segmentation, pos tagging and named entity Recognition
These three tasks are all word analysis, so they are collectively called lexical analysis. The main task of lexical analysis is to separate the text into meaningful words (Chinese word segmentation), determine the category and shallow disambiguation of each word (pos tagging), and identify some longer proper nouns (named entity recognition). For Chinese, lexical analysis is often the basis for subsequent advanced tasks. In pipeline-type systems, if lexical analysis is wrong, the subsequent task will be affected. Fortunately, Chinese lexical analysis has been relatively mature, basically reached the level of industrial use.
The output of the previous system is the input of the later system, and the former system is not dependent on the later system.Copy the code
As a rudimentary and well-resourced task, lexical analysis will be elaborated in later chapters of this book. In addition, since this is the first NLP task the reader will encounter, it will lead to many interesting models, algorithms, and ideas. Therefore, lexical analysis is not only the basic task of natural language processing, but also the chapters it belongs to will become the foundation of the reader’s knowledge system.
1.2.3 Information extraction
After lexical analysis, the text has shown a partial structural trend. At the very least, instead of a long string, the computer sees a list of words that make sense, each with its own part of speech and other labels.
According to these words and labels, we can extract some useful information, from simple high-frequency words to keywords extracted by advanced algorithms, from company names to professional terms, among which a lot of word-level information can be extracted. We can also extract key phrases and even sentences based on statistical information between words, making grainier text more user-friendly.
It is worth mentioning that some statistics used by information extraction algorithms can be reused for other tasks, which will be described in detail in the corresponding sections.
1.2.4 Text classification and text clustering
After breaking the text down into a series of words, we can also do a series of analyses at the article level.
Sometimes we want to know if a passage is positive or negative, determine if an email is spam, or sort through many documents, the NLP task is called text categorization.
At other times, we just want to file similar texts together or exclude duplicate documents without caring about specific categories, a task called text clustering.
These two tasks look similar, but are actually two distinct schools of algorithms, which we’ll cover in separate sections.
1.2.5 Parsing
Lexical analysis can only obtain piecemeal lexical information, and the computer does not know the relationship between words. In some question answering systems, it is necessary to obtain the subject-verb-object structure of the sentence. For example, in the sentence “query the internal medicine patients treated by Dr. Liu”, what the user really wants to query is not “Dr. Liu”, nor “internal medicine”, but “patients”. Although all three words are nouns, and even “Dr. Liu” is closest to the verb of intent, “inquire,” only “patient” is the object of “inquire.” The syntax information shown in Figure 1-3 can be obtained through syntax analysis.
We can see in Figure 1-3 that there is indeed a long arrow linking “query” with “patient” and indicating the verb-object relationship between them. The tree structure above and the implementation of the parser are described in more detail in the following sections.
Not only question answering systems or search engines, syntactic analysis is often used in phrase-based machine translation to reorder the words in the translated text. For example, the Chinese phrase “I eat apple” is translated into Japanese as “private ngo を (apple) eating べ Oulda”, which has a different word order but the syntactic structure is the same.
1.2.6 Semantic analysis and text analysis
Compared with syntactic analysis, semantic analysis focuses on semantics rather than grammar. It includes word sense disambiguation (determining the meaning of a word in context, rather than a simple part of speech), semantic role labeling (labeling the relationship between predicates and other components in a sentence) and semantic dependency analysis (analyzing the semantic relationships between words in a sentence).
As the task progresses, their difficulty gradually increases, and they belong to more advanced topics. Even the most cutting-edge research has not yet achieved a level of practical precision. In addition, relevant research resources are scarce and difficult for the public to obtain, so this book will not be involved.
1.2.7 Other Advanced Tasks
In addition to the above “tool-like” tasks, there are a number of comprehensive tasks that are more closely related to end-application level products. Such as:
● Question-answering, a direct answer to a question based on information in a knowledge base or text, such as Microsoft’s Cortana and Apple’s Siri;
● Automatic summary, for a long document to generate a short summary;
● Machine translation, which translates a sentence from one language to another.
Note that Information retrieval (IR) is generally considered to be a separate discipline from natural language processing. Although the two are closely related, the goal of IR is to query information, while the goal of NLP is to understand language. In addition, IR retrieval is not necessarily language, but also can be the search of pictures, music, product search and even any information search. There are also plenty of real-world scenarios where retrieval tasks can be performed without language understanding, such as LIKE in SQL.
As a primer, this book does not discuss these high-level tasks, but understanding the big picture of natural language processing can help us broaden our horizons and find our orientation.
1.3 Schools of natural language processing
The previous section compared the similarities and differences between natural and artificial languages, showed the difficulties of natural language processing, and introduced some common NLP tasks. This section briefly introduces several different approaches to natural language processing.
1.3.1 Rule-based expert system
A rule is a deterministic process manually developed by an expert. From the regular expressions that programmers use every day to the autopilot of airplanes, they are all fixed rule systems.
In the context of natural language processing, a relatively successful case is Porter Stemming Algorithm, which was proposed by Martin Porter in 1980 and widely used in English stem extraction. The algorithm consists of several rules, each of which is a series of fixed if then condition branches. When the word meets the condition, a fixed procedure is executed and a fixed result is output. Some of these rules are summarized in Table 1-1.
Expert systems require the designer to have a deep understanding of the problem being dealt with and to consider all possible scenarios as humanly as possible. Its biggest weakness is that it is hard to expand. Conflicts tend to occur when the number of rules increases or when multiple experts maintain the same system. For example, in table 1-1, a simple system with only three rules, there is a conflict between rule 1 and rule 2. A word like feed will satisfy the conditions of both rules, thus causing a conflict. At this point, expert systems usually rely on the priority of rules. For example, rule 1 takes precedence over rule 2, and other rules are ignored when the conditions of rule 1 are met. Dozens of rules are acceptable, but as the number of rules and the number of team members increases, the compatibility issues to be considered become more and more complex, and the system maintenance costs become higher and higher, which cannot be expanded.
Most language phenomena are more complex than English stems, as we have already seen. These linguistic phenomena do not necessarily follow the rules, but also change from time to time, making the rule system appear rigid, rigid and unstable.
(1) Different from the car's unmanned driving technology, the aircraft's autopilot system can only deal with the scheduled situation, in abnormal circumstances will alarm or switch to manual driving. ② In the following example, feed is a special case, not past tense, and no substitution is performed. Bled is the past tense of bleed and should not be "de-ed". Sing is not present and "go ing" should not be performed.Copy the code
1.3.2 Statistics-based learning method
To reduce reliance on experts and adapt to flexible language problems, people use statistical methods to make computers learn languages automatically. The so-called “statistics” refers to the statistics carried out on the corpus. A corpus is a structured text that is manually annotated, as we will discuss in the next section.
Because natural languages are so flexible, even linguists can’t come up with complete rules. Even if there were a perfect rule set, it would be difficult to evolve over time. There was no way to describe natural language in programming language, so the smart guys decided to let machines learn these rules by giving examples. The machine then applies these rules to new, unknown examples. In the context of natural language processing, “taking examples” means “making a corpus”.
Statistical learning is another name for machine learning, which is the mainstream approach to contemporary artificial intelligence. The importance of machine learning in natural language processing is so great that it can be said that natural language processing is only an application of machine learning. Here we only use “learning by example” to understand it simply, and the following chapters will study systematically in thick and heavy colors.
1.3.3 history
Since natural language processing is the application layer of machine learning, like the history of artificial intelligence, natural language processing has evolved from logical rules to statistical models. Figure 1-4 lists several important time periods in history.
Artificial intelligence and natural language processing were in their infancy in the 1950s, and there was a lot of ground-breaking work. The most prominent examples are mathematician Alan Turing’s Thesis on Computing Machinery and Intelligence, which proposes sufficient conditions for artificial Intelligence — the Turing Test — and linguist Chomsky’s “Syntactic Structure”, which argues that sentences are generated according to some general grammatical rules independent of context. Interestingly, the pioneers’ early estimates or theories were too optimistic. Alan’s optimistic prediction that in 2014 a computer with 1 GB of ram would be able to remain undetected within five minutes with a 70% chance has not been fulfilled. However, Chomsky’s “universal grammar” is controversial because of its ignorance of semantics, and has been revised in subsequent theories. Both artificial intelligence and natural language processing have a long way to go.
Until the 1980s, the dominant approach was rule systems, in which domain-specific rule sets were hand-written by experts. Computers and computer languages had just been invented, and programming was done by elite academics. They were so ambitious that they thought they could make computers intelligent simply by programming them. Representative efforts include BASEBALL at MIT’s AI Lab and LUNAR at Sun (acquired by Oracle in 2009), which answers questions about North American BASEBALL games and rock samples from the Apollo moon missions, respectively. There were many similar question-and-answer systems of this period, all expert systems that relied heavily on handwritten rules. Take BASEBALL as an example, the part-of-speech tagging module judges the part of speech of score in the following way: “If the sentence does not contain other verbs, score is a verb, otherwise it is a noun.” Then the system relies on the rules of speech to combine noun phrases, prepositional phrases and adverb phrases. The grammar module determines passive sentences, subjects and predicates based on rules such as “if the last verb is the main verb and comes after to be.” The system then uses the rules in the dictionary to convert this information into “attribute name = attribute value” or “attribute name =?” Is used to represent documents and questions in the knowledge base. Finally, use rules like “if all attribute names except question mark match, then output the attribute requested by the question in the document” to match the question and answer. Such rigid and rigid rules result in the system processing only fixed questions, not and or illogical, comparative and time periods. Thus, these systems of rules are called “toys”. In order to express such rule Logic, Prolog (Programming in Logic) language was specially invented in 1972 to build knowledge base and expert system.
After the 1980s, statistical models revolutionized the field of artificial intelligence and natural language processing — people began to annotate corpora for developing and testing NLP modules: In 1988 hidden Markov models were used for pos tagging, in 1990 IBM published the first statistical machine translation system, and in 1995 the first robust syntactic parser (based on statistics) appeared. In pursuit of higher accuracy, people continue to annotate larger corpora (TREC question answering corpus, CoNLL named entity recognition, semantic role annotation and dependency syntactic corpus). The development of larger corpora and hardware attracts people to apply more complex models. By 2000, a number of machine learning models were in widespread use, such as perceptrons and conditional random fields. Instead of relying on a rigid system of rules, people expect machines to learn the rules of language automatically. To improve the accuracy of the system, either change to a more advanced model or annotate more corpus. NLP systems can now be scaled up robustly, rather than relying on rules written by experts. But experts are still at work. Drawing on linguistic knowledge to design feature templates for statistical models (representing corpus in a form that computers can easily understand) has become an instant method, a process known as “feature engineering”. In 2010, the SVM-based Turbo dependency parser on Penn Treebank achieved 92.3% accuracy ①, which was the most advanced system at that time. This book will focus on practical statistical models and implementations that are not unattainable, but can be implemented and run on common hardware resources.
After 2010, the scale of corpus and hardware computing power have been greatly improved, creating conditions for the revival of neural network. However, with the increase of annotated data, the accuracy of traditional models is becoming less and less obvious, and more complex models are needed, so deep neural networks return to the field of vision of researchers. Neural networks are still a type of statistical model, the theory of which was laid down around the 1950s. In 1951, Marvin Lee Minsky designed the first machine to simulate a neural network. In 1958, Rosenblatt first proposed the famous perceptron, a neural network model that can simulate human perception. In 1989, Yann LeCun trained the first deep convolutional neural network at Bell LABS to recognize handwritten numbers using a U.S. Postal Service data set. Limited only by the amount of computing power and data, neural networks were not widely used until around 2010, with a new term called “deep learning” to distinguish them from previous shallow models. The charm of deep learning lies in that it no longer relies on experts to make feature templates, but can automatically learn abstract representations of raw data, so it is mainly used for presentation learning. As a primer, we only introduce some concepts and applications in the last chapter as a bridge between traditional methods and deep learning.
(1) To be exact, it is an Unlabeled Attachment Score that ignores punctuation under The Stanford standard, which will be described in detail in Chapter 12.Copy the code
1.3.4 Rules and Statistics
Pure rule systems have become obsolete, and expert systems have fallen out of fashion except for some simple tasks. In the 1970s, Jarnick, a member of the American Academy of Engineering, was developing a speech recognition system at IBM laboratories. “Every time I fired a linguist, my speech recognition system got a little more accurate,” he remarked. (1) It’s a bit harsh, but it’s fair to say that as machine learning matures, the role of domain experts is getting smaller and smaller.
In practical engineering, linguistic knowledge plays a role in two aspects: one is to help us design more concise and efficient feature templates; the other is to play a role in corpus construction. In fact, the actual operating system still uses some handwriting rules for pre-processing and post-processing. Of course, there are special cases that are easier to handle with rules.
This book respects engineering practice and introduces the construction of practical NLP system in a statistical and rule-based way.
1.3.5 Traditional methods and deep learning
Although deep learning has made great strides in computer vision, it has not made much headway in basic tasks in natural language processing. This conclusion may come as a surprise, but as a data science practitioner, data is the best way to illustrate a problem. Table 1-2 shows the cutting-edge accuracy of pos tagging tasks in the Corpus of Wall Street Journal.
"Every time I fire a linguist, the performance of the speech recognizer goes up". ② "author's last name (year)" is a common paper citation format, which can be searched by this information (including subject keywords if necessary).Copy the code
By 2015, all other systems except bi-LSTM-CRF are traditional models with the highest accuracy of 97.36%, while bi-LSTM-CRF deep learning model is 97.55%, only 0.19% higher. In 2016, the traditional NLP4J system achieved 97.64% accuracy by using additional data and dynamic feature extraction algorithms.
A similar situation is repeated in the parsing task, as shown in Table 1-3, taking the accuracy of the Pennsylvania tree bank according to the Stanford standard.
In 2014, the first neural network driven syntactic parser is not as accurate as TurboParser, and after several years of development, the accuracy finally reaches 95.7%, 3.4% higher than the traditional algorithm. This achievement is very significant in academic circles, but not in practical use.
On the other hand, deep learning involves a lot of matrix operations and requires the acceleration of special computing hardware (GPU, TPU, etc.). Currently, the price of an entry-level tower server is around 3,000 yuan, and a virtual server only costs about 50 yuan a month, but an entry-level computing graphics card alone costs 5,000 yuan. In terms of cost performance, traditional machine learning methods are more suitable for small and medium-sized enterprises.
In addition, the migration from traditional methods to deep learning cannot happen overnight. The relationship between the two is basic and advanced. Many basic knowledge and concepts will be easier and easier to understand with traditional methods, and they will also be used repeatedly in deep learning (such as the combination of CRF and neural network). Both traditional models and neural networks belong to machine learning. Mastering traditional methods can not only solve engineering problems when computing resources are limited, but also lay a solid foundation for future challenges of deep learning.
1.4 Machine learning
In the previous section, we encountered some machine learning terminology. In line with recursive learning, let’s now recursively understand the basic concepts of machine learning.
While this book focuses on natural language processing and does not have a detailed section on machine learning, it will introduce machine learning algorithms under the hood when appropriate. Machine learning is the cornerstone of natural language processing, and some basic concepts still need to be mastered in advance. Mastering these terms also helps us communicate fluently with others.
1.4.1 What is Machine learning
In 1959, Arthur Samuel, a pioneer in the field of artificial intelligence, defined machine learning as a way to give computers more power without directly programming them.
Smart readers have probably wondered whether computers can only perform steps designed by humans. Machine learning offers a positive answer to this question: machines can learn to improve their capabilities without the need for programmers to hardcode that capability. Tom Mitchell, a fellow of the American Academy of Engineering, gives a more precise definition. Machine learning refers to the ability of a computer to improve on a task by using data from its experience.
In short, machine learning is algorithms that allow machines to learn algorithms. This is a bit of a roundabout way to think about the familiar database analogy: metadata in a database refers to the data that describes the data (table names, fields, etc.), and one row of which is plain data. By analogy, machine learning algorithms can be called “meta-algorithms”, which instruct the machine to automatically learn another algorithm, which is then used to solve a real problem. To avoid confusion, the algorithm being learned is often called a model.
1.4.2 model
A model is a mathematical abstraction of a real problem, consisting of a hypothetical function and a set of parameters. For a simple example, we want to predict the gender of Chinese names. Assume that Chinese names are determined by the symbols output by the function f x(), with negative numbers indicating female and non-negative numbers indicating male.
The definition of f x() we chose is as follows:
Where w and b are parameters of the function, and x is its independent variable. The model, then, refers to the whole function, including the parameters. However, the model does not include a specific argument x, because the argument is entered by the user. The independent variable x is an eigenvector used to represent the characteristics of an object.
Readers can understand Eq. (1.1) as the linear equation in junior high school, the plane equation in senior high school, or the hyperplane equation in higher-dimensional space. Anyway, don’t worry about the abstractness of the problem; we’ll implement the case completely in code in Chapter 5.
1.4.3 characteristics
Characteristics refer to the value of the transformation of the characteristics of things, such as the characteristics of a cow is four legs and zero wings, and the characteristics of a bird is two legs and one pair of wings. So what are the characteristics of Chinese names in gender identification?
First of all, for a Chinese name, the surname has nothing to do with gender, what really matters is the name. The computer doesn’t know which part is the last name and which part is the first name. Surnames are useless features and should not be extracted. In addition, some special characters (Zhuang, yan, Jian and qiang) are commonly used by men, while others (Li, Yan, bing and xue) are commonly used by women, and some (Wen, Hai, Bao and yu) are commonly used by both men and women. Let’s express people’s names in a form that computers can understand, and whether a name contains these words or not is the easiest characteristic to think of. In expert systems, we explicitly program:
What if someone named “Shen Yanbing”? “Yan” sounds male, while “bing” sounds female, which is actually a male name. It seems that each word is not equally related to men and women, and “wild goose” seems to be more related to men than “ice” is to women. This conflict seems to be resolved by “prioritization,” but let the machine do the work. In machine learning, “priority” can be seen as feature weights or model parameters. We just need to define a set of features and let the algorithm automatically determine their weight based on the data. To facilitate computer processing, they are represented as numerically typed features, a process called feature extraction. Take the feature extraction of “Shen Yan Ice” as an example, as shown in Table 1-4.
The number of features depends on the question. Obviously, two features are not enough to infer the gender of the name. We can increase it to four, as shown in Table 1-5.
① The writer MAO Dun, whose original name was Shen Dehong, styled himself Yan Bing, is also known as "Shen Yanbing".Copy the code
Sometimes we can add location information to the feature, such as “does it end in snow?” We can also combine the two features to get a new feature, such as “does it end with snow and the penultimate character is blowing”, so that the special name “Ximen Blowing snow” will get a special treatment without being confused with “xiao Xue” and “Lu Xueqi”.
In engineering, instead of writing features word by word, we define a set of templates to extract features. For example, if the name is name, the feature template is defined as name[1] + name[2] and so on. As long as we traverse some names, the features that may be combined by name[1] + name[2] are basically covered. This automatic feature extraction template is called a feature template.
How to select features and how to design feature templates is called feature engineering. The more features, the more parameters; The more parameters, the more complex the model. The complexity of the model should match the data set, which is introduced in the next section along recursive learning lines.
1.4.4 data set
How do you get the machine to learn automatically to get the parameters of the model? You have to have a problem set first. There are many problems that cannot be solved by algorithms (rules) directly (for example, gender recognition of name, we can not say clearly what kind of name is male), so we prepared a large number of examples (name X and its corresponding gender Y) as problem sets, hoping that the machine can automatically learn the rules of Chinese name from the problem sets. The “example” is usually called a sample.
This problem set, called a data set in machine learning and a corpus in natural language processing, will be covered in more detail in Section 1.5. The types of data sets are very large and vary from task to task. Table 1-6 contains some commonly used data sets.
When working with a dataset, we must consider not only its size and annotation quality, but also its authorization. Most data sets are not commercially available, and data sets in many unpopular fields are also scarce. At this time, we can consider self-labeling.
1.4.5 Supervised learning
If the problem set is accompanied by standard answer Y, the learning algorithm is called supervised learning. Supervised learning algorithms have the machine work through the problems, compare them with standard answers, and finally correct the model for errors. In most cases, the error of learning once is not small enough and requires repeated learning and adjustment. The algorithm at this time is an iterative algorithm, each learning is called an iteration. Supervised learning is called “teacher charms learning” in Japanese, meaning “learning with a teacher”. By providing standard answers, humans point out the model’s errors and act as teachers.
This process of iterative learning on labeled data sets is called training, and the data sets used in training are called training sets. The result of training is a set of parameters (feature weights) or models. Using the model, we can calculate a value for any name that is male if it is non-negative and female otherwise. This process is called prediction.
To sum up, the supervised learning process is shown in Figure 1-5.
In the case of gender identification:
● Unstructured data are many names like “Shen Yanbing” and “Ding Ling”;
● After manual labeling, a labeled data set containing many samples similar to “Shen Yanbing = male” and “Ding Ling = female” was obtained;
● Then a model is obtained through training algorithm;
● Finally, using this model, we can predict the gender of any name, such as “Lu Xueqi”.
The names to be predicted may not appear in the dataset, but we can still expect a high accuracy as long as the sample size is sufficient and balanced between men and women, the feature template is properly designed and the algorithm is implemented correctly.
In addition, the annotated data in Figure 1-5 is actually structured data. But because of the cost of manual tagging, it is sometimes referred to as “gold data” and is quite different from what models predict, with some margin of error.
Starting with Chapter 3, this book will detail some useful supervised learning methods in NLP.
1.4.6 Unsupervised learning
Can a machine still learn if we only give it the questions and don’t give it the reference answers?
Yes, the learning is called unsupervised learning, and the problem set without standard answers is called an unlabeled dataset. Unsupervised learning is referred to in Japanese as “teacher gun learning,” meaning learning without a teacher. Without the guidance of a teacher, the machine can only say that it finds associations between samples, but cannot learn associations between samples and answers.
Unsupervised learning is generally used for clustering and dimensionality reduction, neither of which requires annotation data.
Clustering was covered in Section 1.2 and will not be covered again. In the case of gender identification, if we choose to cluster a series of names into two clusters, “Zhou Shuren” and “Zhou Liren” are likely to be in one cluster, and “Lu Xueqi” and “Cao Xueqin” are likely to be in the other cluster. This is determined by the similarity between the samples and the granularity of the clusters, but we don’t know which clusters represent males and which clusters represent females, and they are not necessarily distinguishable by the naked eye.
Dimensionality reduction refers to the process of transforming sample points from higher dimensional space to lower dimensional space. High-dimensional data in machine learning can be found everywhere. For example, in the case of gender recognition, the number of features with commonly used Chinese characters easily exceeds 2000. If the sample has n features, then the sample corresponds to a point in n +1 dimensional space, and the extra dimensions are for the dependent variables of the hypothesis function. If we want to visualize these sample points, we must reduce the dimensions to two or three dimensions. The central idea of some dimensionality reduction algorithms is to minimize the loss of information after dimensionality reduction, or to make the variance of samples in each dimension in the low-dimensional space as large as possible. Consider this extreme case: some steel pipes of equal length are inserted vertically into a flat ground, and the top of the steel pipe is reduced in dimension to a two-dimensional plane, which is the hole left after the steel pipe is removed. The length of the steel pipe in the vertical dimension is the same, there is no useful information, so it was abandoned.
Some unsupervised methods can also be used to drive Chinese word segmentation, part-of-speech tagging, and syntactic analysis. Because of the wealth of unstructured data stored on the Internet, unsupervised learning is tempting. However, in the case of unsupervised learning, there is no information exchange between the model and the user. As a result, the model cannot capture the user’s standard due to the lack of supervised signals, and the final prediction result is often far from the ideal answer in the user’s mind. At present, the accuracy of unsupervised learning NLP tasks is always several to dozens of percentage points lower than that of supervised learning, which cannot meet the production requirements.
This book will introduce the principle and implementation of clustering algorithm in detail in Chapter 10.
1.4.7 Other types of machine learning algorithms
If we train multiple models and then perform predictions on the same instance, we get multiple results. If most of these results agree, the instance and result can be put together as a new training sample to expand the training set. Such an algorithm is called semi-supervised learning. Semi-supervised learning is becoming a hot research topic because it can make comprehensive use of labeled data and rich unlabeled data.
Things in the real world tend to have long causal chains: we have to execute a series of interrelated decisions correctly to get to the end result. Such problems often require both prediction and planning for the next decision based on feedback from the environment. This type of algorithm is called reinforcement learning. Reinforcement learning has been successful in problems involving human-computer interaction, such as autonomous driving, e-sports, and question-and-answer systems.
As a primer, this book does not delve into these frontiers. But knowing that these branches exist helps to build a complete body of knowledge.
① It is called heuristic semi-supervised learning, which is the easiest to understand of all semi-supervised learning methods.Copy the code
1.5 corpus
As a data set in the field of natural language processing, corpus is an indispensable problem set for teaching machines to understand language. In this section, we will take a look at the common corpora in Chinese processing and the topic of corpus construction.
1.5.1 Chinese word segmentation corpus
Chinese word segmentation corpus refers to a set of sentences correctly segmented by human.
Take the famous People’s Daily corpus in 1998 as an example. The corpus was annotated by the Institute of Computational Linguistics of Peking University and Fujitsu Research and Development Center Co., LTD., with the permission of the News and Information Center of People’s Daily from April 1999 to the end of April 2002. The corpus has a scale of 26 million Chinese characters, and is sold in the first half of 1998 (about 13 million words = about 7.3 million words).
In the second International Chinese Word Segmentation Competition in 2005, the corpus of about a month was published. Here’s an example:
There was inflation first, and deflation later.
From this simple annotated corpus, no linguistic knowledge is required to ask why “inflation” is one word, but “deflation” is two. This involves the standardization of corpus annotation and the internal consistency of annotators. We will cover these topics in more detail in subsequent chapters, but for now just keep the impression that corpus specifications are hard to make and specifications are hard to enforce.
In fact, although the total amount of Chinese word segmentation corpus is not large, there are many factions. We’ll learn about licensing, downloading, and using these corpora in Chapter 3.
1.5.2 Pos tagged corpus
It refers to a corpus that is segmented and assigns a part of speech to each word. In short, we have to show the machine what we want to teach it. Taking the Corpus of People’s Daily as an example, the People’s Daily in 1998 contains a total of 43 parts of speech, which is called pos tagging set. A sample sentence from the corpus reads:
Towards /v full of/V hope/N/U new /a century/N -- / W 1998 / T New Year /t speech/N (/w attached /v pictures /n 1/m pictures /q) /wCopy the code
Here, each word is followed by a slash. The meaning of each part of speech will be described in detail in Chapter 7. It is worth noting in this sentence that the part of speech of “hope” is “noun” (n). In other sentences, “hope” can also be used as a verb.
1.5.3 Named entity Recognition Corpus
This corpus artificially labels entity nouns and entity categories that are of interest to the creators inside the text. For example, the Corpus of People’s Daily contains three named entities, namely people’s name, place name and organization name:
Sahaf/NR said/V, / W Iraq/NS will /d/with/P [UN/NT destruction/V Iraq/NS WMD/N weapons/N special/A Commission/N] / NT continue/V maintain/V cooperate/V. /wCopy the code
The bold words in this sentence are the name of the person, the place and the organization. Compound words are enclosed in parentheses. It can be observed that sometimes agency names and place names are combined to form longer agency names. This nesting phenomenon in word-formation increases the difficulty of named entity recognition.
What a named entity type is depends on what the corpus producer cares about. In chapter 8 of this book, we will demonstrate how to annotate a corpus for identifying fighter names.
1.5.4 Syntactic analysis corpus
The syntactic analysis corpus commonly used in Chinese is CTB (Chinese Treebank). The construction of this corpus began in 1998 and has been releasing several improved versions with contributions from university of Pennsylvania, University of Colorado and Brandeis University. CTB 8.0, for example, contains a total of 3007 articles from the news, broadcast and Internet, totaling 71 369 sentences, 1 620 561 words and 2 589 848 characters. Each sentence is marked with word segmentation, part-of-speech tagging and syntax tagging. A visualized sentence is shown in Figure 1-6.
In Figure 1-6, the English label above the Chinese word indicates the part of speech, while the arrow represents the two words with grammatical connection. The specific connection is indicated by the label on the arrow. The visualization and utilization of syntactic analysis corpus will be introduced in Chapter 12.
1.5.5 Text categorization corpus
It refers to the corpus composed of artificially labeled articles belonging to the classification. Compared with the four corpora introduced above, the data volume of text categorization corpus is obviously much larger. Take the famous Sogou text classification corpus as an example. IT contains 10 categories, including automobile, finance, IT, health, sports, tourism, education, recruitment, culture and military. Each category contains 8000 pieces of news, with each piece of news about hundreds of words.
In addition, the columns on some news websites have been manually sorted by editors, which are highly differentiated from each other and can also be used as a corpus of text classification. Sentiment categorization corpus is a subset of text categorization corpus, which is limited to “positive” and “negative” categories.
If the categories and scales in these corpora do not meet the actual needs, we can also mark them as needed. The process of tagging is essentially organizing many documents into different folders.
1.5.6 Corpus construction
Corpus construction refers to the process of constructing a corpus, which is divided into three stages: specification formulation, personnel training and manual annotation.
Specification formulation refers to the analysis and development of a set of annotation specifications by linguistic experts, including annotation set definitions, samples and implementation methods. In the field of Chinese word segmentation and partof speech tagging, the well-known norms include The “Norms for Modern Chinese Corpus Processing — Word Segmentation and Partof Speech Tagging” issued by The Institute of Computational Linguistics, Peking University, and the “Norms for Modern Chinese Word Class Tagging for Information Processing” issued by Standardization Administration of China.
Personnel training refers to the training of annotators. Because of human resource constraints, the same people may not be making and enforcing norms. Large corpora often need collaborative annotation by many people, and the annotators’ understanding of the norms must be consistent, otherwise it will lead to internal conflicts among annotators and affect the quality of the corpus.
For different types of tasks, many annotation software have been developed, among which a more mature one is BRAT (BRAT Rapid Annotation Tool) ①, which supports pos tagging, named entity recognition, parsing and other tasks. Brat is a typical B/S architecture. The server is written in Python and the client runs in a browser. Compared with other annotation software, brat’s biggest highlight is the multi-person collaborative annotation function. In addition, the towed operation experience also benefits brAT.
① See http://brat.nlplab.org/ for details.Copy the code
1.6 Open Source Tools
At present, there are many excellent NLP tools contributed by the open source community, which provide us with a variety of choices, such as NLTK (Natural Language Toolkit), CoreNLP developed by Stanford University, As well as LTP (Language Technology Platform) developed by Hit and HanLP (Han Language Processing) developed by me.
1.6.1 Comparison of mainstream NLP Tools
In choosing a toolkit, we need to consider: functionality, precision, operational efficiency, memory efficiency, scalability, commercial licensing, and community activity. Table 1-7 compares the four major open source NLP toolkits.
As for the development speed of these open source tools, according to the trend of the number of stars on GitHub, HanLP is the fastest growing, as shown in Figure 1-7.
1) about HanLP compared with specific performance of LTP, please refer to the @ zongwu233 third-party source evaluation: https://github.com/zongwu233/ HanLPvsLTP. About HanLP and stammer, IK, Stanford, Ansj, word of other Java open source word segmentation performance comparison, may refer to alibaba architects Yang Shangchuan third-party source evaluation: https://github.com/ysc/cws_evaluation. I do not guarantee the accuracy and fairness of third-party open source reviews, and I do not trust any closed source reviews. This book details how to evaluate the accuracy of common NLP tasks in a standardized manner in related chapters.Copy the code
② The number of stars on GitHub as of August 2019.
In addition, I have studied the principles of other open source projects and borrowed good designs from them. But after all, the code written by myself is the most clear, so comprehensive consideration of the above, finally selected HanLP as the realization of this book.
1.6.2 Python interface
Thanks to Python’s compact design, calling HanLP from this dynamic language saves a lot of time. Regardless of whether you use Python regularly, it’s recommended to try it out.
HanLP’s Python interface is provided by the PyHanLP package and can be installed with a single command:
$ pip install pyhanlpCopy the code
This package relies on Java and JPype. If a Windows user encounters the following error:
building '_jpype' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual
C++ Build Tools": http://visualstudio.microsoft.com/visual-cpp-build-tools/Copy the code
You can install either Visual C++ as prompted or the more lightweight Miniconda. Miniconda is an open source distribution of the Python language that provides easier package management. During installation, select the two check boxes shown in Figure 1-8.
Then run the following command:
$ conda install -c conda-forge jpype1
$ pip install pyhanlpCopy the code
If you encounter Java-related problems:
jpype._jvmfinder.JVMNotFoundException: No JVM shared library file (jvm.dll) found. Try setting up the JAVA_HOME environment variable properly.Copy the code
Install the Java runtime environment (1). The HanLP main project is developed in Java, so a JDK or JRE is required. If other errors occur, please go to the project discussion area (2) to report the problem.
If all goes well, verify the installation by typing the following command from the command line:
If Linux users experience permission issues, they need to execute sudo hanLP. Because pyHanLP automatically downloads HanLP’s JAR packages (containing many algorithms) and packages (containing many models) to pyHanLP’s system path when first run.
From the command line, we can easily invoke common functions provided by HanLP without writing code.
1) website (http://www.oracle.com/technetwork/java/javase/downloads/index.html) recommend choosing the JDK 8 above.Copy the code
(2) as shown in theGithub.com/hankcs/HanL...
Run the hanlp segment command to enter the interactive word segmentation mode. If you type a sentence and press Enter, HanLP prints the segmentation result:
$hanlp segment Commodity and service commodity /n and /cc service /vn When it rains ground water extremely serious when /p rainy day /n ground/N water extremely serious /a Wang Zong and Xiao Li got married Wang Zong /nr and /cc Xiao Li/NR got married /vi /uleCopy the code
Under Linux it is also possible to redirect a string as input:
$hanlp segment <<< 'Welcome new teacher to dinner before his death' welcome /v new /a old/A teachers and students/N to /vi dinner /viCopy the code
Note that Windows does not support string <<< direction. You can only enter strings manually.
Here the pos tagging is implemented by default and we can disable it:
$hanLP segment --no-tag <<< 'Welcome new teacher to dinnerCopy the code
Any platform supports redirecting file input/output, such as storing a novel as input.txt:
$head input. TXT Chapter 1 Hidden worry Zhang Xiaofan looked at the front of the middle-aged scholar, that is, the right way today's heart of the great evil "ghost king", in the mind a confusion. These days, he had little doubts about his old beliefs from time to time in his heart, which were actually rooted in a conversation at a tea stall at the foot of kongsang Mountain. Now, seeing the old man again, this mood is really complicated, almost let him forget for a moment here and now. But if he did, the guy next to him didn't. Small week stretched out his hand to wipe the blood around his mouth, reluctantly stood up, low voice to Zhang Xiaofan, Tian Linger two people: "this person way too high, can not force enemy, I will hold him, you two people go quickly!" With that, he put out his hand and stuck it upside down in the rock wall. The "Seven Star Sword", which still vibrates slightly till now, broke through the wall with a sound of "zheng" and flew back to his hand. The ghost King looked at Xiao Zhou and nodded with a smile on his face, saying, "With your way of doing things, it seems that among the younger disciples of Qingyun Gate, you will be the first. Can't think of qingyun door in addition to this Zhang Xiaofan, actually have you such a talent, good, good!" Zhang Xiaofan was startled, but found that the teacher sister Tian Linger and the eyes of the small week are aimed over, some heat on the face for a while, but do not know what to say.Copy the code
With redirection, the novel can be segmented with a single command:
$ hanlp segment < input.txt > output.txt -a crf --no-tagCopy the code
The -a parameter is used to specify the word segmentation algorithm as CRF. We’ll cover this algorithm in more detail in Chapter 6. Now, let’s take an emotional look at the effects of CRF participles:
Chapter one hidden worry Zhang Xiaofan looked at the front of the middle-aged scholar, that is, the right way of today's bosom big trouble "ghost king", a confusion in the mind. These days, he had little doubts about his old beliefs from time to time in his heart, which were actually rooted in a conversation at a tea stall at the foot of kongsang Mountain. Now, seeing the old man again, this feeling is really complicated, almost let him forget the here and now situation. But if he did, the guy next to him didn't. Small week stretched out his hand to wipe the blood around his mouth, reluctantly stood up, low voice to Zhang Xiaofan, Tian Linger two people: "this person way too high, can not force enemy, I will hold him, you two people go quickly!" With that, he put out his hand and stuck it upside down in the rock wall. The seven-star sword, which still vibrates slightly, broke through the wall with a sound of "zheng" and flew back to his hand. The ghost King looked at Xiao Zhou and nodded with a smile on his face, saying, "With your way of doing things, it seems that among the younger disciples of Qingyun Gate, you will be the first. Can't think of qingyun door in addition to this Zhang Xiaofan, actually have you such a talent, good, good!" Zhang Xiaofan was startled, but found that the teacher sister Tian Linger and the eyes of the small week are aimed over, some heat on the face for a while, but do not know what to say.Copy the code
The effect seems to be ok, “ghost king” “empty Mulberry mountain” “seven star sword” “qingyun door” and other words are correctly cut out. But there is still something to be desired. For example, why is “here and now” and “startled” considered a word? Are these criteria set by the participle writer? We will discuss each of these in subsequent chapters.
The same is true for parsing, which is a command:
These commands also support a number of other parameters, which can be seen in the latest help manual with the –help argument:
After your initial experience with HanLP, take a look at how to call HanLP’s common interfaces in Python. Here’s a big, incomplete example:
The common functions of HanLP can be invoked through the utility class HanLP without creating an instance. For other more comprehensive function is introduced, it can be reference in making demos directory: https://github.com/hankcs/pyhanlp/tree/master/tests/demos.
1.6.3 Java interfaces
Java users can easily introduce the HanLP library via Maven by adding the following dependencies to the project’s POP.xml:
In addition, you can access page https://github.com/hankcs/HanLP/releases for the latest version.
HanLP can then be called in one sentence:
System.out.println(HanLP. Segment (" Hello, welcome to HanLP Chinese processing package!" ));Copy the code
Commonly used API still wrapped in HanLP tool classes, you can learn about the usage of the interface via https://github.com/hankcs/HanLP. Of course, you can familiarize yourself with these features as this book explains them.
HanLP data is separated from the program. To reduce the size of the JAR package, the Portable version contains only a small amount of data. For some advanced functions (CRF word segmentation, parsing, and so on), additional packets need to be downloaded and their locations told to HanLP through configuration files.
If the reader has installed PyhanLP, the package and configuration files are already installed. We can obtain their paths by using the following command:
The last line, hanlp.properties, is the required configuration file, which we simply copy to the project’s resources directory, SRC /main/resources (create one manually if you don’t have one). The HanLP from/usr/local/lib/python3.6 / site – packages/pyhanlp/static load data, that is to say, with pyhanlp share the same set of data packets.
If the reader has not installed PyhanLP, or wants to use standalone data, it is not difficult. Only need to visit the project homepage https://github.com/hankcs/HanLP, download the data. The zip and unzip it to a directory, for example, D: / hanlp. Then download and unzip hanlP-1.7.5-release.zip and set the first line root in hanlp.properties to the parent directory of the data folder:
root=D:/hanlpCopy the code
Notice For Windows users, the path separator uses a slash (/). The Windows default “\” conflicts with escape characters in most programming languages, such as “\n” in “D:\ NLP”, which can actually be interpreted as a newline by Java and Python, causing problems.
Finally, move hanlP.properties to the Resources directory of your project.
Since this book will delve into the internal implementation of HanLP, it is also recommended that you fork and clone a copy of the source code on GitHub. The file structure in the repository is as follows:
Limited by file size, the repository still does not contain the full Model folder, requiring users to download packages and configuration files. Download mode has automatic and manual two, the book Java supporting code will be automatically downloaded and decompressed when running, in addition, users can download and decompress. As mentioned earlier, create the Resources directory and put hanlP.properties into it. Then put the downloaded data/model into the corresponding directory of the version library. The path after completion is as follows:
Next, we can run this book form a complete set of code (form a complete set of code in the SRC/test/Java/com/hankcs/book). Now let’s run a HelloWord (see ch01/ helloword.java) :
HanLP.Config.enableDebug(); System.out.println(HanLP. Segment (" kingdom "));Copy the code
Run it and you get output similar to the following:
There are two differences from the previous example.
● We turned on debug mode, which prints the intermediate results of the run to the console.
● We are running the GitHub Repository edition, where dictionaries and models are in text form. Dictionaries in HanLP generally have both textual and binary forms, and their relationship is similar to that of source code and programs. When the binary does not exist, HanLP loads the text dictionary and automatically caches the binary with the same name. Binary loads much faster than text, usually five times faster. For example, in the example above, it took 341 ms to load the text, but only 64 ms to load the corresponding binary when run again. Through a caching mechanism and an internal rewrite of the IO interface, HanLP can control the cold start of the system within a few hundred milliseconds. This provides great convenience for programmers to debug repeatedly.
Looking at the debugging output, it is divided into two processes: coarse-splitting and subdividing. The result of the rough splitting process is [kingdom /n, peacekeeper /vn, waiter/NNT], which is obviously unreasonable, and the sentence should not be understood as such. So in the subdivision process, the algorithm has carried on the name recognition, recalled “Wang Guowei” this word. Then the algorithm feels that [Wang Guowei/NR, and /cc, waiter/NNT] are much smoother and takes them as the final result.
There’s a lot of detail inside the algorithm, but we’ve got a weapon in hand. The basic skeleton, forging process and usage scenarios of specific weapons will be explained step by step in a recursive form.
1.7 summarize
This chapter presents a macro thumbnail and development timeline of ARTIFICIAL intelligence, machine learning and natural language processing. Machine learning is a subset of artificial intelligence, while natural language processing is the intersection of artificial intelligence, linguistics and computer science. Although this intersection is small, it is very difficult. In order to realize the ambitious goal of understanding natural language, people try to use rule system, and finally develop statistical learning system based on large-scale corpus.
In the following chapters, let’s solve the first NLP problem — Chinese word segmentation according to this development law from easy to difficult. We’ll start with regular systems, introduce fast and inaccurate algorithms, and then evolve to more accurate statistical models.
① The reason why it’s “unreasonable” rather than “incorrect” is that we can’t rule out the existence of a maverick kingdom in some fantasy world with a peacekeeping force staffed not by soldiers but by waiters. But the likelihood of that happening is so low that it’s almost impossible.
This article is from Introduction to Natural Language Processing