The previous article briefly introduced linear regression, logistic regression, and Bayesian classification respectively, and implemented them in Python. This paper introduces the simpler KNN, k-neighbor algorithm (KNN, K-Nearest neighbor). K-neighbor algorithm (kNN, k-NearestNeighbor) is one of the simplest machine learning classification algorithms. Its core idea is to use the classification of k sample data closest to the target to represent the classification of the target (the K sample data is most similar to the target data).

The principle of

The core idea of kNN algorithm is to represent the classification of target data with k sample data that are closest to each other (in various ways of measuring distance).

Specifically, there is a training sample set, and each sample contains data characteristics and classification values. Input new data, compare the data with each sample in the training sample set, and find the k with the closest distance. Among the K data, the classification with more occurrences can be used as the classification of the new data.

  • Supervised learning: The training sample set contains classified information
  • The algorithm is simple and easy to understand and implement
  • Results are influenced by k values, which generally do not exceed 20.
  • A large amount of calculation is required to calculate the distance from each sample in the sample set.
  • Imbalanced training sample sets lead to inaccurate results

Let’s do a simple implementation using Oython and try it out for dating site matchmaking.

Simple Python implementation

def classify(inX, dataSet, labels, k):
    """Define KNN algorithm classifier function: Param inX: Test data: Param datasets: training data: Param labels: category: Param K: K value :return: belonging category"""

    dataSetSize = dataSet.shape[0]  #shape (m, n) m column n features
    diffMat = np.tile(inX, (dataSetSize, 1)) -dataset sqDiffMat = diffMat ** 2 sqdiffmat. sum(axis=1) Cull = sqDiffMat ** 0.5# Euclidean distance
    sortedDistIndicies = distances.argsort()  Sort and return index

    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1 #default 0

    sortedClassCount = sorted(classCount.items(), key=lambda d:d[1], reverse=True)
    return sortedClassCount[0][0]
Copy the code

The steps of the algorithm are described in detail above, the above calculation is matrix operation, the next function is algebraic operation, to make a comparison.

def classify_two(inX, dataSet, labels, k):
    m, n = dataSet.shape   # shape (m, n) m column n features
    Calculate the Euclidean distance from the test data to each point
    distances = []
    for i in range(m):
        sum = 0
        for j in range(n):
            sum += (inX[J] -dataset [I][J]) ** 2 culture. append(sum ** 0.5) sortDist = sorted(accommodate)The category to which the most recent values belong
    classCount = {}
    for i in range(k):
        voteLabel = labels[ distances.index(sortDist[i])]
        classCount[voteLabel] = classCount.get(voteLabel, 0) + 1 # 0:map default
    sortedClass = sorted(classCount.items(), key=lambda d:d[1], reverse=True)
    return sortedClass[0][0]
Copy the code

With the classifiers above, here’s the simplest experiment to predict:

Def createDataSet () : group = np array ([[1, 1.1], [1, 1], [0, 0], [0, 0.1]]) labels = ['A'.'A'.'B'.'B']
    return group, labels
Copy the code

Above is a simple set of training samples.

if __name__ == '__main__':
    dataSet, labels = createDataSet()
    r = classify_two([0, 0.2], dataSet, labels, 3)
    print(r)
Copy the code

Execute the above function: you can see that the output B, [0,0.2] should be classed as B.

The simplest kNN classifier is shown above. Here is an example.

KNN is used to judge the popularity of people on dating websites

Some data of the training sample set are as follows:

40920	8.326976	0.953952	3
14488	7.153469	1.673904	2
26052	1.441871	0.805124	1
75136	13.147394	0.428964	1
38344	1.669788	0.134296	1
Copy the code

The first column represents the number of frequent flyer miles earned per year, the second column represents the percentage of time spent playing video games, and the third column represents the number of litres of ice cream consumed per week. The fourth column represents the classification results, 1, 2 and 3 are dislike, average and very attractive respectively.

  1. Convert the data to NUMpy.
# Convert text to numpy
def file2matrix(filepath="datingSet.csv"):
    dataSet = np.loadtxt(filepath)
    returnMat = dataSet[:, 0:-1]
    classlabelVector = dataSet[:, -1:]
    return returnMat, classlabelVector
Copy the code
  1. First of all, have a perception of data, know which features affect classification, and carry out visual data analysis.
# 2, 3 column data were analyzed
def show_2_3_fig():
    data, cls = file2matrix()
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(data[:, 1], data[: ,2], c=cls)
    plt.xlabel("playing game")
    plt.ylabel("Icm Cream")
    plt.show()
Copy the code


You can see that there are distinct distinctions between different people based on their characteristics. Therefore, kNN algorithm can be used for classification and prediction.

  1. Because distance comparison is used later, the influence of the previous data is large, such as the difference between the number of airplane miles and the number of ice cream. Therefore, it is necessary to normalize data.
# Data normalization
def autoNorm(dataSet):
    minVal = dataSet.min(0)
    maxVal = dataSet.max(0)
    ranges = maxVal - minVal

    normDataSet = np.zeros(dataSet.shape)
    m, n = dataSet.shape  # line, feature
    normDataSet = dataSet - minVal
    normDataSet = normDataSet / ranges
    return normDataSet, ranges, minVal
Copy the code
  1. The accuracy of KNN algorithm can be measured by the accuracy rate or error rate. An error rate of 0 means that the classification is good. Therefore, 10% of the training sample can be used for testing and 90% for training.
Define the function that tests the algorithmDef datingClassTest (h = 0.1) : hoRatio = h datingDataMat, datingLabels = file2matrix() normMat, ranges, minVals = autoNorm(datingDataMat) m, n = normMat.shape numTestVecs = int(m * hoRatio)# test the number of rows
    errorCount = 0  # error classification number


    # Test with the top 10%
    for i in range(numTestVecs):
        classifierResult = classify(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m],  3)
        # print('the classifier came back with: %d,the real answer is: %d' % (int(classifierResult), int(datingLabels[i])))
        ifclassifierResult ! = datingLabels[i]: errorCount += 1print("the total error rate is: %f" % (errorCount / float(numTestVecs)))
Copy the code

Adjust different test proportions and compare results.

  1. KNN was used for prediction. With training samples and classifiers, new data can be predicted. The simulated data and predictions are as follows:
# Make simple predictions
def classifypersion():
    resultList = ["none".'not at all'.'in small doses'.'in large doses']
    # Simulation dataFfmiles = 15360 playing_game = 8.545204 ice_name = 1.340429 datingDataMat, datingLabels = file2matrix() normMat, ranges, minVals = autoNorm(datingDataMat)inArr = np.array([ffmiles, playing_game, ice_name])
    # Normalization of forecast data
    inArr = (inArr - minVals) / ranges
    classifierResult = classify(inArr, normMat, datingLabels, 3)
    print(resultList[int(classifierResult)])
Copy the code

You can see basically what category you’re getting into.

Complete the code and data refer to: Github :kNN

conclusion

  • kNN
  • Supervised learning
  • Data visualization
  • Data normalization does not affect the calculation