Machine Learning for Humans, Part 2.1: Supervised Learning
By Vishal Maini
Translator: Flying Dragon
Protocol: CC BY-NC-SA 4.0
Log-probability regression (LR) and support vector machine (SVM) were used for classification.
Classification: predictive labels
Is this email spam? Can the borrowers repay their loans? Will users click on ads? Who’s that guy in your Facebook photo?
Classification predicts discrete target label Y. Classification is the problem of assigning new observations to the classes they are most likely to belong to, based on models built from tagged training sets.
The accuracy of your classification depends on the effectiveness of the algorithm you choose, how you apply it, and how much useful training data you have.
Logarithmic probability regression: 0 or 1?
As long as you dare, LR will conquer the world.
Logistic regression is a classification method: the model outputs the probability that the target variable Y belongs to a particular class.
A good example of classification is to determine whether a loan applicant is a fraud.
Ultimately, lenders want to know whether they should lend to borrowers and that they have some tolerance against the risk that the applicant is indeed a fraud. Here, the goal of logarithmic probability regression is to calculate the probability (0 to 100 percent) that the applicant is a fraud. Using these probabilities, we can set thresholds above which we are willing to lend, and below which we reject loan applications, or mark them for subsequent observation.
While log-probability regression is commonly used for binary categorization, where only two classes exist, note that classifiers can have any number of classes (for example, assigning labels from 0 to 9 to handwritten numbers, or using face recognition to detect which friends are in a Facebook picture).
Can I use ordinary least squares?
Can’t. If you train a linear regression model on a large number of samples, where Y = 0 or 1, you might end up predicting some probability of less than 0 or greater than 1, which makes no sense. Instead, we use a log-probability regression model (or logit model), which is designed to assign probabilities of “Y belongs to a particular class,” ranging from 0% to 100%.
What’s the math?
Note: The math in this section is interesting, but more technical. If you’re not interested in advanced senior years, feel free to skip it.
The logarithmic model is an improvement of linear regression by applying the Sigmoid function to ensure the probability between the output 0 and 1. If I were to draw it, it would look like an S-shaped curve, as you’ll see later.
The sigmoid function, which compresses the value between 0 and 1.
Recall the original form of our simple linear regression model, which we now call G (X) because we intend to use it in composition functions.
Now, to solve the problem that the model output is less than 0 or greater than 1, we intend to define a new function F(g(X)) that transforms g(X) by compressing the output of the current regression into the interval [0,1]. Can you think of a function that can do that?
Did you think of the sigmoid function? That’s great! That’s right!
So we plug g(x) into the sigmoid function and get a function of the original function (yes, things are getting higher order) that prints the probability between 0 and 1.
In other words, we are calculating the probability that “the training sample belongs to a particular class” : P(Y=1).
Here we’ve separated p, which is the probability that Y is equal to 1, on the left-hand side. If we were to solve for the very neat beta 0 + beta 1x + ϵ on the right hand side of the equation, so that we could directly explain the beta parameter that we learned, we would get the logarithmic probability ratio, or the logarithmic probability, which is on the left. This is where the “logarithmic model” comes in.
The logarithmic probability ratio is simply the natural logarithm of the probability ratio p over (1-p), which comes up in our everyday conversations.
What do you think are the chances of imp dying on “Game of Thrones” this season?
B: well… Twice as likely to die than not. The odds are two to one. Yes, he’s too important to be killed, but we all saw what they did to Ned Stark…
Note that in the logarithmic model, beta 1 represents the percentage of change in the logarithmic as X changes. In other words, it’s the slope of the ratio, not the slope of the probability.
The logarithm may be a little unintuitive, but it’s worth understanding because it comes up again when you interpret the output of the neural network that performs the classification task.
Use the output of the logarithmic regression model to make decisions
The output of a logarithmic regression model, like an S-shaped curve, shows P(Y=1) based on the value of X.
In order to predict the y-tag, spam, cancer, fraud, and so on, you need to set a probability cutoff value, or threshold. For example, if the model thinks that the probability of a message being spam is greater than 70%, it is marked as spam. Otherwise it’s not garbage.
This threshold depends on your tolerance for false positives (false positives) and false negatives (false negatives). If you’re diagnosing cancer, you have a very low tolerance for false negatives, because if a patient has a very small chance of getting cancer, you need further tests to confirm that. So you need to set a very low threshold for positive results.
On the other hand, in the case of fraudulent loan applications, the tolerance of false positives is higher, especially for small loans, because further review is expensive, and small loans are not worth the additional operating costs, and are a barrier to non-fraudulent applicants who are awaiting further processing.
Minimum loss of logarithmic probability regression
As in the linear regression example, we use gradient descent to learn the beta parameter that minimizes the loss.
In a logarithmic regression, the cost function is a measure of how often you predict 1 when the real answer is 0, or vice versa. Below is the regularized cost function, just as we did with linear regression.
When you see a long formula like this, don’t panic. Break it down into small pieces and think conceptually about what each piece is. And then you’ll understand.
The first part is the data loss, that is, the difference between the model predicted and the actual value. The second part is the regularity loss, which is to what extent we penalize the larger parameters of the model for placing too much weight on certain features (remember, this prevents overfitting).
We use low descent to minimize the loss function, just like that. We construct a log-probability regression model to predict classification as accurately as possible.
Support vector machine
Again, we’re in a room full of marbles. Why are we always in a room full of marbles? I could have sworn I had thrown them away.
SVM is the last parameterized model we cover. It usually solves the same problem of binary classification as logarithmic regression and produces similar effects. It’s worth understanding because algorithms are intrinsically geometrically driven, not probabilistic.
Examples of some problems that SVM can solve:
- Is this picture a cat or a dog?
- Is the review positive or negative?
- Are dots red or blue on a two-dimensional picture?
Let’s use a third example to show how SVM works. Problems like these are called toy problems because they are not real. But nothing is real, so it doesn’t matter.
In this case, we have points in two dimensions that are either red or blue, and we’re going to clean them up.
The training set is drawn in the picture above. We’re going to partition new unclassified points on this plane. To achieve this, SVM uses dividing lines (a multidimensional hyperplane in higher dimensions) to divide space into red regions and blue regions. You can imagine what the dividing lines look like in the diagram above.
So, more specifically, how do we pick where to draw this line?
Here are two examples of this line:
These charts were made using MicrosoftPaint, which was scrapped a few weeks ago after an incredible 32 years. R.I.P Paint 🙁
I hope you have the intuition that the first line is better. The distance between a line and the nearest point on each side is called the spacing, and SVM tries to maximize the spacing. You can think of it as a safe space: the larger the space, the less likely noisy points are to be misclassified.
Based on this simple explanation, a huge problem arises.
(1) What is the mathematics behind it?
We intend to look for the optimal hyperplane (straight lines in our two-dimensional example). This hyperplane requires (1) cleanly separating data, dividing blue dots to one side and red dots to the other, and (2) maximizing spacing. This is an optimization problem. When the spacing is maximized as required by (2), the solution needs to comply with constraint (1).
The human version of the problem is to take a ruler and try different lines to separate all the points until you get the one with the greatest spacing.
It has been discovered that there is a mathematical way to solve this maximization, but it is beyond our scope. To further explain it, here’s a video handout, using Lagrange optimization to show how it works.
The definition of the hyperplane that you end up solving for, in terms of its position with respect to a particular X_I, they’re called support vectors, and they’re usually the points closest to the hyperplane.
(2) What happens if you can’t cleanly separate data?
There are two ways to approach this problem.
2.1 Soften the definition of “separation”
We’re going to allow some errors, which is we’re going to allow some blue dots in the red region, or some red dots in the blue region. So let’s take the loss function. Add cost C for misclassified samples. Basically we’re saying that misclassification is acceptable, but there’s some cost involved.
2.2 Put data in higher dimensions
We can create nonlinear classifiers by increasing the dimension, that is, including x^2, x^3, even cosine of x, and so on. All of a sudden, you have a boundary, and when we bring it back to the lower dimensional representation, it looks curved.
Essentially, it’s like the red and blue marbles are on the ground, they can’t be separated by a straight line. But if you take all the red marbles off the ground, like the one on the right, you can draw a flat surface to separate them. And then you let them fall back to the ground, and you know the boundary between blue and red.
Nonlinearly separable data set in two-dimensional space R^2, and the same data set mapped to higher dimensions, the third dimension is x^2+y^2 (source: www.eric-kim.net/eric-kim-ne…).
The decision boundaries are shown in green, with three dimensions on the left and two dimensions on the right. Same source as the previous slide.
In short, SVM is used for binary classification. They try to find a plane that cleanly separates the two classes. If this is not possible, we can soften the definition of “delimit,” or we can put the data in higher dimensions so that we can cleanly delimit the data.
Ok!
In this section we covered:
- Supervise the classification of learning tasks
- Two basic classification methods: Log-probability regression (LR) and support vector Machine (SVM)
- Common concepts: Sigmoid functions, logarithmic probabilities (pairs), and false positives (false positives) and false negatives (false negatives)
In “2.3: Supervised Learning III”, we will go into nonparametric supervised learning, where the concepts behind the algorithm are intuitive and perform well for specific types of problems, but the model may be difficult to explain.
Practice material and expand reading
2.2 A logarithmic probability regression
Data School has an excellent in-depth guide to logarithmic probability regression. We also continue to recommend An Introduction to Statistical Learning to you. See chapter 4 for logarithmic probability regression and chapter 9 for support vector machines.
To explain logarithmic probability regression, we recommend that you tackle this problem set. You need to register the site to complete it. Unfortunately, that’s life.
2.2 b into the SVM
To dive into the mathematics behind SVM, watch Professor Patrick Winston’s handout in MIT 6.034: Artificial Intelligence, and check out this tutorial to complete the Python implementation.