Yesterday, I introduced one of the simplest algorithms in machine learning: kNN (K-nearest neighbor algorithm), and implemented it step by step in Python. At the same time, for comparison, the kNN algorithm package in Sklearn is called with only 5 lines of code. Both methods achieve the same goal: they both correctly solve the dichotomous problem, that is, the newly poured wine belongs to cabernet Sauvignon.

Article Portal:

The simplest kNN algorithm for Python handwriting machine learning (clickable)

While it was nice to call the Sklearn library algorithm and solve the problem with a few lines of code, we were in a black box and didn’t really understand what was going on behind Sklearn. As a beginner, if you do not understand the principle of the algorithm and switch directly, you will only learn the surface, there is no egg use.

So let’s take a look at how Sklearn encapsulates the kNN algorithm and implement it in Python. This will make it clearer when we call the Sklearn algorithm package later.

Let’s review the kNN algorithm’s 5 lines of code from yesterday’s Sklearn:

from sklearn.neighbors import KNeighborsClassifier 
kNN_classifier = KNeighborsClassifier(n_neighbors=3)
kNN_classifier.fit(X_train,y_train )
x_test = x_test.reshape(1.- 1)
kNN_classifier.predict(x_test)[0]
Copy the code

The code has been explained, but I’ll continue with a diagram:

So, Sklearn calls almost all machine learning algorithms in the same way: you feed the training data to the algorithm of choice, you fit it, you calculate a model, you feed the data to the model, you predict, you output the results, classification and regression.

It is worth noting that **kNN is a special algorithm, it does not need training (FIT) to build the model, it can directly take the test data on the training set to predict the results. ** This is one of the reasons why kNN is the simplest machine learning algorithm.

But in Sklearn above, why do I do the fit step here? Well, you don’t have to, but Sklearn’s interface is pretty neat, so in order to be consistent with most algorithms we use the training set as a model.

As we go through more algorithms, we’ll see that each algorithm has some characteristics that we can summarize and compare.

You can see that there is no training process by organizing yesterday’s handwritten code into a function:

import numpy as np
from math import sqrt
from collections import Counter

def kNNClassify(K, X_train, y_train, X_predict):
    distances = [sqrt(np.sum((x - X_predict)**2)) for x in X_train]
    sort = np.argsort(distances)
    topK = [y_train[i] for i in sort[:K]]
    votes = Counter(topK)
    y_predict = votes.most_common(1) [0] [0]
    return y_predict
Copy the code

Next, according to the idea above, the kNN algorithm encapsulated in Sklearn is written from the bottom step by step to write how the 5 lines of code works:

import numpy as np
from math import sqrt
from collections import Counter

class kNNClassifier:
    def __init__(self,k):
        self.k =k
        self._X_train = None
        self._y_train = None

    def fit(self,X_train,y_train):
        self._X_train = X_train
        self._y_train = y_train
        return self
Copy the code

First, we need to rewrite the previous function into a Class named kNNClassifier, because the algorithms in Sklearn are object-oriented, so it’s easier to use classes.

If you’re not familiar with classes, check out my previous article:

An understanding of Python classes (click)

We define three initial variables in the __init__ function, k for the k neighbor points we want to pass in.

_X_train and self._y_train are preceded by an underscore _, which means that they are treated as internal private variables that can only be operated on internally and cannot be changed externally.

Then we define a fit function, which is used to fit the kNN model, but the kNN model does not need to fit, so we copy the data set as it is, and return the two data sets themselves.

Here we have to make some constraints on the input variables, one is that X_train and Y_train should have the same number of rows, and the other is that the k nearest neighbor points we choose cannot be illegal, such as negative numbers or more than the sample points, otherwise the subsequent calculation will go wrong. To make a constraint, use an assert statement:

def fit(self,X_train,y_train):
        assert X_train.shape[0] == y_train.shape[0]."You add an assert to ensure that you enter a normal data set and k value. If you do not add an abnormal value, it will be difficult to find the cause of the error."
        assert self.k <= X_train.shape[0]
        self._X_train = X_train
        self._y_train = y_train
        return self
Copy the code

Next, we will pass in the sample point to be predicted and calculate the distance between it and each sample point, corresponding to the predict in Sklearn, which is the core part of the algorithm. And this is the code that we wrote before, so we can just take it and use it, add a few lines of assertion to make sure that the variables that we put in are reasonable.

def predict(self,X_predict):
        assert self._X_train is not None."I want to run fit before PREDICTING so that self._x_train is not empty."
        assert self._y_train is not None
        assert X_predict.shape[1] == self._X_train.shape[1]."Requires that the test set and the prediction set have the same number of features."

        distances = [sqrt(np.sum((x_train - X_predict)**2)) for x_train in self._X_train]
        sort = np.argsort(distances)
        topK = [self._y_train[i] for i in sort[:self.k]]
        votes = Counter(topK)
        y_predict = votes.most_common(1) [0] [0]
        return y_predict
Copy the code

Here we have a simple Sklearn kNN encapsulation algorithm, save it as the knn_sklear. py file, and run the test in jupyter Notebook:

First get the basic data:

# sample set
X_raw = [[13.23.5.64],
       [13.2 ,  4.38],
       [13.16.4.68],
       [13.37.4.8 ],
       [13.24.4.32],
       [12.07.2.76],
       [12.43.3.94],
       [11.79.3.  ],
       [12.37.2.12],
       [12.04.2.6 ]]
X_train = np.array(X_raw)

# eigenvalue
y_raw = [0.0.0.0.0.1.1.1.1.1]
y_train = np.array(y_raw)

# To be predicted
x_test= np.array([12.08.3.3])
X_predict = x_test.reshape(1.- 1) 
Copy the code

Note: When there is only one predictor, be sure that (1,-1) is 0 or there will be an error.

Running programs in the Jupyter Notebook uses the magic command %run:

%run kNN_Euler.py
Copy the code

You can run the knn_euler. py program directly, and then call the kNNClassifier class in the program with the k parameter 3 and name it an instance knn_false.

kNN_classify = kNNClassifier(3)
Copy the code

Next, pass the sample set X_train, y_train to instance FIT:

kNN_classify.fit(X_train,y_train)
Copy the code

After fit is completed, the sample to be predicted is passed into X_predict for prediction, and then the classification result can be obtained:

y_predict = kNN_classify.predict(X_predict)
y_predict

[out]:1
Copy the code

The answer is 1, which is the same as yesterday with both methods.

Is it not difficult?

Further, if we predict not one point at a time, but multiple points, for example, to predict which category the following two points belong to:

Can you also give the classification of the predictions? The answer, of course, is yes. We only need to modify the above encapsulation algorithm a little, and change the predict function as follows:

def predict(self,X_predict):
        y_predict = [self._predict(x) for x in X_predict]  # List generation is to store the classification results in the list and then return
        return np.array(y_predict)

def _predict(self,x):  # _predict Private function
        assert self._X_train is not None
        assert self._y_train is not None

        distances = [sqrt(np.sum((x_train - x)**2)) for x_train in self._X_train]
        sort = np.argsort(distances)
        topK = [self._y_train[i] for i in sort[:self.k]]
        votes = Counter(topK)
        y_predict = votes.most_common(1) [0] [0]
        return y_predict
Copy the code

Predict uses a list generator to store multiple predicted category values. The predicted values come from the _predict function. The underscore in front of _predict also indicates that it is an encapsulated private function that is only used internally and cannot be called from the outside because it is not needed.

After the algorithm is written, we only need to pass in multiple prediction samples, here we pass two:

X_predict = np.array([[12.08.3.3 ],
		[12.8.4.1]])
Copy the code

Output forecast results:

Y_predict = knn_false. Predict (X_predict) Y_predict [out] : array([1.0])
Copy the code

Look, it returns two values, the first sample is classified as 1 for Cabernet Sauvignon, and the second sample is classified as 0 for Pinot Noir. That’s exactly what it turned out to be. It’s perfect.

Here, we write the kNN algorithm according to the Sklearn algorithm encapsulation way, but the kNN algorithm in Sklearn is much more complex than this, because the kNN algorithm has a lot to consider, such as processing a shortcoming of the **kNN algorithm: computing time. ** Simply put, the running time of kNN algorithm is highly dependent on the dimension of sample set and the number of eigenvalues. When the dimension is very high, the running time of the algorithm will increase rapidly. The specific reasons and improvement methods will be discussed later.

Now there is an important question. We have implemented kNN on all the training sets, but we do not know how well it predicts and how accurate it is. In the next article, we will talk about how to measure the effectiveness of kNN.

The Jupyter Notebook code of this article can be obtained by replying “kNN2” in the background of my official account: “Advanced migrant Workers”. Come on!