1 introduction

This chapter is a preliminary introduction to machine learning, including an overview of some of the main learning tasks and applications, basic definitions and terms, and a discussion of some general scenarios.

1.1 What is machine learning?

Machine learning can be broadly defined as computational methods that use experience to improve performance or make accurate predictions. In this context, experience refers to past information available to the learner, usually in the form of electronic data collected and used for analysis. This data can be in the form of digitized, human-tagged training sets, or other types of information acquired through interaction with the environment. In all cases, its quality and size are critical to the success of the learner’s prediction. An example of a learning problem is how to use a limited sample of randomly selected documents, each labeled with a topic, to accurately predict the topic of documents that have never been seen. Obviously, the larger the sample size, the easier the task. But the difficulty of the task also depends on the quality of the tags assigned to the documents in the sample, as they may not all be correct, and on the number of possible topics. Machine learning involves designing efficient and accurate prediction algorithms. As in other areas of computer science, some key measures of the quality of these algorithms are their temporal and spatial complexity. However, in machine learning, we also need the concept of sample complexity to evaluate the sample set size needed by the algorithm to learn a set of concepts. More generally, the theoretical learning assurance of the algorithm depends on the complexity of the concept classes considered and the size of the training sample set. Machine learning is intrinsically linked to data analysis and statistics because the success of learning algorithms depends on the data used. More generally, learning technology is a data-driven approach that combines basic concepts in computer science with the ideas of statistics, probability, and optimization.

1.2 What kind of problems can be solved with machine learning?

Predicting document tagging, also known as document categorization, is by no means the only learning task. Machine learning has a wide range of practical applications, including:

  • Text or document classification. This includes issues such as assigning a topic to text or documents, or automatically determining whether web page content is inappropriate or overly explicit; It also includes spam detection.
  • Natural Language Processing (NLP). Most tasks in the field, including part-of-speech tagging, named entity recognition, context-free resolution, or dependency resolution, are considered learning problems. In these problems, the prediction acknowledges certain structures. For example, in pos tagging, the prediction of a sentence is the sequence of pos tags tagged for each word. In context-free parsing, the prediction is a tree. These are known to be richer actsStructured forecasting problemExamples of learning problems.
  • Voice processing applications. This includes speech recognition, speech synthesis, speaker verification, speaker recognition, and subproblems such as language modeling and acoustic modeling.
  • Computer vision applications. These include object recognition, object verification, face detection, optical character recognition (OCR), content-based image retrieval or pose estimation.
  • Computational biology applications. This includes prediction of protein function, identification of key sites, or analysis of gene and protein networks.
  • Many other issues, such as credit card, phone or insurance company’s fraud detection, network intrusion, learn, play chess, backgammon, or go game, such as the robot or a car without auxiliary control, medical diagnosis, recommendation system, search engine or information extraction system is designed, using machine learning techniques to solve.

This list is not comprehensive. Most of the predictive problems found in practice can be reduced to learning problems, and the field of practical application of machine learning is expanding. The algorithms and techniques discussed in this book can be used to derive solutions to all of these problems, although we will not discuss these applications in detail.

1.3 Some standard learning tasks

Here are some standard machine learning tasks that have been studied extensively:

  • Classification: It’s a matter of assigning each item a category. For example, document classification includes assigning a category to each document, such as politics, business, sports, or weather, while image classification includes assigning a category to each image, such as car, train, or plane. The number of categories in such tasks is usually less than a few hundred, but in some difficult tasks the number of categories can be much larger and even as unrestricted as optical character recognition, text categorization, or speech recognition.
  • Regression: This is a question of predicting the true value of each item. Examples of regression include predictions of changes in stock values or economic variables. In regression, the penalty for incorrect predictions depends on the size of the difference between true and predicted values, unlike classification problems, where there is usually no concept of proximity between different categories.
  • Sorting: It’s a matter of learning to sort terms according to certain criteria. Web searches, such as returning pages related to a search query, are typical examples of sorting. Many similar sorting problems arise in the design of information extraction or natural language processing systems.
  • clustering: This is the problem of dividing a set of items into a subset of the same kind. Clustering is usually used to analyze very large data sets. For example, in the context of social network analysis, clustering algorithms attempt to identify nature in large crowdsgroup
  • Dimension reduction or manifold learning: This involves transforming the initial representation of an item into a lower-dimensional representation while preserving some properties of the original representation. A common example is preprocessing digital images in computer vision tasks.

The main practical goal of machine learning is to generate accurate predictions for items that have never been seen before, and to design efficient and robust algorithms to generate these predictions, even for large-scale problems. To this end, many algorithmic and theoretical problems arise. Some basic questions include: Which families of concepts can actually be learned, and under what conditions? To what extent can these concepts be learned computationally?

1.4 Learning Stage

Here, we will use the authoritative question of spam detection as a running example to illustrate some basic definitions and describe the use and evaluation of machine learning algorithms in practice, including their different stages. Spam detection is a matter of learning to automatically categorize E-mail as spam or non-spam. Here is a list of definitions and terms commonly used in machine learning:

  • Example: An item or instance of data for study or evaluation. In our spam problem, these examples correspond to the collection of e-mails we will use to study and test.
  • Characteristic: A set of attributes, usually represented as a vector, associated with an example. In the case of E-mail messages, some relevant characteristics might include the length of the message, the name of the sender, various characteristics of the header, the presence of certain keywords in the body of the message, and so on.
  • Label: The value or category assigned to the example. In the classification problem, examples are assigned specific categories, such as the spam and non-spam categories in our binary classification problem. In regression, items are given real-value labels.
  • Hyperparameter: A free parameter that is not determined by the learning algorithm, but is specified as an input to the learning algorithm.
  • Sample: an example used to train a learning algorithm. In our spam problem, the training sample consists of a set of E-mail samples and their associated labels. Training samples vary for different learning scenarios, as described in section 1.5.
  • Validation sample: An example used to adjust the parameters of the learning algorithm when processing marked data. The verification sample is used to select appropriate values for the free parameters (hyperparameters, and the number of parameters of the learning algorithm is also considered as hyperparameters) of the learning algorithm.
  • Sample: An example used to evaluate the performance of a learning algorithm. Test samples are separate from training and validation data and are not available during the learning phase. In the spam problem, the test sample consists of a set of E-mail samples for which the learning algorithm must predict labels based on features. These predictions are then compared with their labels to measure the algorithm’s performance.
  • Loss function: A function that measures the difference or loss between the predicted label and the actual label. The set of all labels is expressed as Y\mathcal YY, and the set of possible predictions is expressed as Y ‘\mathcal Y\ ‘Y’, The loss function LLL is a mapping L:Y×Y ‘→R+L:\mathcal Y\ times\mathcal Y\ ‘\to\R_+L:Y×Y’ →R+. In most cases, Y ‘= Y\mathcal Y\ ‘=\mathcal YY ‘= Y and the loss function is bounded, but these conditions are not always true. Common examples of loss functions include passing L(y,y ‘)=1y ‘≠yL(y, Y ‘) = 1 _ _ – {{y ‘\ neq y}} L (y, y’) = 1 y ‘ = y in {- 1, + 1} {- 1, + 1} \ {- 1, + 1 \} \ times \ {- 1, + 1 \} {- 1, + 1} ({- 1, + 1} defined on 0-10 to 10-1 (or misclassification) loss and in J * J \ mathcal J \ times \ mathcal JJ * J defined on the squared loss L (y, y ‘) = (y ‘- y) 2 L (y, y’) = (y ‘- y) ^ 2 L (y, y’) = (y ‘- y) 2, Where J⊆R\mathcal J\ Subseteq RJ⊆R is usually a bounded interval.
  • Assuming that set: Maps features (feature vectors) to tag sets
    Y \mathcal Y
    A set of functions of theta. In our case, these might map E-mail characteristics to
    Y = { spam . Non-spam } \mathcal =\{spam, non-spam \}
    A set of functions of theta. More generally, the hypothesis can be to map the features to different sets
    Y   \mathcal Y\ ‘
    The function. They can be a mapping of E-mail eigenvectors to be interpreted asScores(
    Y   = R \mathcal Y\ ‘=\R
    ), higher score values are more indicative of spam than lower score values.

We now define the learning phase of the spam problem (see Figure 1.1). Let’s start with an example of a given set of tags. We first randomly divided the data into training samples, verification samples and test samples. The size of each sample set depends on many different considerations. For example, the amount of data reserved for validation depends on the number of hyperparameters of the algorithm, which are vectors
Θ \pmb\Theta
Said. In addition, when the number of labeled samples is relatively small, the amount of training data is usually chosen to be larger than the amount of test data, because the learning performance directly depends on the training samples.

Figure 1.1: Diagram of typical stages of the learning process.

Next, we associate the relevant features with the examples. This is a key step in designing machine learning solutions. Useful features can effectively guide the learning algorithm, while poor or uninformative features can be misleading. While this is critical, the choice of features is largely left to the user. This choice reflects the user’s prior knowledge of the learning task, which in practice can have a significant impact on performance results. Now we train our learning algorithm A\mathcal AA using the selected feature by adjusting the value of its free parameter θ \ PMB \Theta θ θ (also known as the hyperparameter). For each value of these parameters, the algorithm selects a different hypothesis from the hypothesis set. We choose the one that produces the best performance on the verification sample (θ 0\ PMB \Theta_{_0} θ θ 0). Finally, using this hypothesis, we predict the labels of the samples in the test sample. The performance of the algorithm is evaluated by using a loss function associated with the task, such as a 0−10-10−1 loss in our spam detection task, to compare the predicted and true tags. Therefore, the performance of the algorithm is of course evaluated based on its test error rather than its error on the training sample.

1.5 Learning Scenarios

Let’s briefly describe some common machine learning scenarios. These scenarios differ in the type of training data available to the learner, the order and method in which the training data is received, and the test data used to evaluate the learning algorithm.

  • Supervised learning: The learner receives a set of labeled examples as training data and makes predictions about all never-seen points. This is the most common scenario associated with classification, regression, and sorting problems. The spam detection problem discussed in the previous section is an example of supervised learning.
  • Unsupervised learning: The learner specifically receives unlabeled training data and makes predictions for all never-seen points. Because, in general, there are no labeled instances in this case, it is difficult to quantitatively evaluate the performance of the learner. Clustering and dimension reduction are examples of unsupervised learning problems.
  • Semi-supervised learning: The learner receives a training sample of labelled and unlabelled data and makes predictions for all never-seen points. Semi-supervised learning is common in environments where unlabeled data is easy to obtain but labeled data is expensive to obtain. Various types of problems that arise in applications, including classification, regression, or sequencing tasks, can be considered as instances of semi-supervised learning. It is hoped that the distribution of unmarked data available to the learner will help him perform better than he would in a supervised environment. Analyzing under what conditions this can really be achieved is the subject of much modern theoretical and applied machine learning research.
  • Transductive reasoning: As in the semi-supervised scenario, the learner receives a set of labeled training samples and a set of unlabeled test points. However, the goal of transduction reasoning is to predict only the labels of these specific test points. Transductive reasoning seems to be an easier task, matching the scenarios encountered in a variety of modern applications. However, as in semi-supervised Settings, the assumption that better performance can be achieved in this setting is a research problem that has not yet been fully solved.
  • Online learning: Unlike previous scenarios, the online scenario involves a mix of multiple rounds of training and testing phases. In each round, the learner receives an unmarked training point, makes a prediction, then receives the true label and calculates the loss. The goal of an online setting is to minimize cumulative losses or minimize all turnsI regret to hear that., namely, the difference between the calculated cumulative loss and the loss of the best expert method after the event. Unlike the Settings discussed earlier, there are no distribution assumptions in online learning. In fact, in this case, instances and their labels may be chosen in opposition.
  • Reinforcement learningTraining and testing phases are also mixed in with reinforcement learning. In order to gather information, the learner actively interacts with the environment, in some cases influencing it, and is immediately rewarded for every action. The goal of the learner is to maximize his return over a series of actions and iterations with the environment. However, the environment does not provide the long-term reward feedback that the learner facesExploration and utilizationBecause he must choose between exploring unknown behaviors in order to obtain more information or using the information he has collected to make known behaviors.
  • Active learning: The learner adaptively or interactively collects training examples, usually by asking experts to request tags for new data points. The goal of active learning is to achieve with standard supervised learning scenarios (orPassive learningScenario) with fewer labeled examples. Active learning is often used in applications where tag acquisition costs are high, such as computational biology applications.

In practice, you may encounter many other intermediate and slightly more complex learning scenarios.

1.6 the generalization

Machine learning is fundamentally aboutgeneralization. For example, the standard supervised learning scenario involves using a limited sample of tagged examples to make accurate predictions about never-before-seen examples. The question is usually expressed as a question fromAssuming that set, that is, a subset of the family of all functions. The selected function is then used to mark all instances, including those never seen.

How should I choose my hypothesis set? For richness orcomplexOf the hypothesis set that the learner can select with the training sampleConsistent with theFunction or predictor, that is, a function or predictor that has no errors for the training sample. For less complex families, some errors in the training sample may be inevitable. But this leads to better generalization, right? How should we define the complexity of our hypothesis set?

Figure 1.2 illustrates these two types of solutions: one is a zigzag line, which perfectly separates the two groups of blue and red dots, and is selected from a complex family; The other is a smoother line chosen from the simpler family, which only partially distinguishes between the two groups. As we will see, in general, the best predictor over the training sample may not be the best overall. Predictors selected from a very complex family can basically remember the data, but generalization is different from memorizing training labels.

Figure 1.2: The zigzagging line on the left panel is consistent across the blue and red training samples, but it is a complex plane of separation and is unlikely to generise well to invisible data. In contrast, the decision face on the right panel is simpler, and although it misclassifies a few points in the training sample, it may better summarize it.

We’ll see that the trade-off between sample size and complexity plays a key role in generalization. When the sample size is relatively small, selection from overly complex families may result in poor generalization, which is also known as overfitting. On the other hand, for an overly simple home, sufficient precision may not be achieved, which is called insufficient assembly. In the following chapters, we will examine the problem of generalization in more detail and seek theoretical guarantees for learning. This will depend on the different concepts of complexity that we will discuss thoroughly.

2 PAC Learning framework

When designing and analyzing algorithms that learn from examples, several basic questions arise: What can be learned effectively? What is inherently difficult to learn? How many examples does it take to learn successfully? Is there a universal learning pattern? In this chapter, we begin to formalize and address these issues by introducing the Roughly Right (PAC) learning framework. The PAC framework helps define classes of learnable concepts based on the number of sample points needed to obtain an approximate solution, the complexity of the sample, and the temporal and spatial complexity of the learning algorithm, depending on the cost of the computational representation of the concept. We first describe and illustrate the PAC framework, and then give some general learning guarantees within the framework when the set of assumptions used is finite, both for consistent cases where the set of assumptions used contains the concepts to be learned and for contrary inconsistencies.

2.1 PAC learning model

We begin by introducing several notations needed to define and represent the PAC model, which will be used throughout most of the book. We use X to represent all possible examples or the set of instances. X is sometimes referred to as the input space. The set of all possible tag or target values is represented by Y. For the purposes of this introductory chapter, we restrict ourselves to the case where Y is reduced to two labels, Y = {0,1}, which corresponds to the so-called binary classification. Later chapters will extend these results to more general Settings. A concept c: X to Y is a mapping from X to Y. Since Y = {0,1}, we can identify c as a subset of X valued at 1. Thus, in the following, we refer equivalently to learning concepts as a mapping from X to {0,1} or as a subset of X. For example, a concept can be a set of points within a triangle or an indicator function of those points. In this case, the concept we’re going to learn in a nutshell is a triangle. A concept class is a set of concepts that we might want to learn, denoted by C. For example, this could be the set of all the triangles in the plane. Let’s assume that the example is independently identically distributed based on some fixed but unknown distribution. Learners consider a fixed set of possible concepts H, called a hypothesis set, which may not necessarily be consistent with C. It receives the sample S = (x1… , xm) draw the inner diameter according to D and label (C (x1)… , c(XM)), which are all learned based on the specific objective concept C ∈ C. The task is then to use the labeled sample S to select the hypothesis hS∈ H with a small generalization error relative to concept C. It is assumed that the generalization error of H ∈ H, also known as the risk or true error of H (or error for short), is represented by R(h) and defined as follows.


H \mathcal H