K-nearest Neighbor (kNN) algorithm is a classical classification algorithm with supervision. The core idea is that if most of the k Nearest samples of a sample in the feature space belong to a certain category, the classification results of the sample also belong to this category.
1. Algorithm steps
- Prepare training data and test data;
- Determine the parameter k;
- The distance between test data and each training data is calculated and the increasing relationship of distance is sorted.
- Select k points with the smallest distance;
- Determine the occurrence frequency of the category of the first K points;
- The category with the highest frequency in the first K points is returned as the predicted classification of the test data.
2. KNN is implemented by Python code
2.1 Algorithm Implementation
# python 3.7.2 from numpy import * import operator def kNNClassify(testData, trainData, labels, k): DataSize = traindata.shape [0] # diffMat = tile(testData, (dataSize, 1)) - trainData # SqDiffMat = diffMat ** 2 sqDistances = sqdiffmat. sum(axis=1) # Compute the rows of the matrix and cull = sqDistances ** 0.5 # Accommodate sortedDisindexes = distance.argsort () # return the corresponding indexes classCount = {} for I in range(k): voteLabel = labels[sortedDisindexes[i]] classCount[voteLabel] = classCount.get(voteLabel, 0) + 1 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), Reverse =True) return sortedClassCount[0][0]Copy the code
Assume that the training data is:
TrainData = [[1, 1.1], [1, 1], [0, 0], [0, 0.1]] labels = [' A ', 'A', 'B', 'B']Copy the code
The test data are as follows:
TestData = [[1.1, 1], [0.1, 0]]Copy the code
2.2 Actual combat: Dating site object matching
Num on a dating website browse sister, sister to see every evaluation: largeDoses, smallDoses, didntLike, on the basis of evaluation has 3:
- Miles traveled per year
- Playing games takes up part of a day
- Number of desserts eaten per week
1000 pieces of related data are collected and stored in datingTestset.txt
40920 8.326976 0.953952 largeDoses 14488 7.153469 1.673904 smallDoses 26052 1.441871 0.805124 didntLike 75136 13.147394 0.428964 didntLike 38344 1.669788 0.134296 didntLike 72993 10.141740 1.032955 didntLike 35948 6.830792 1.213192 LargeDoses 42666 13.276369 0.543880 largeDoses 67497 8.631577 0.749278 didntLike 35483 12.273169 1.508053 largeDoses 50242 3.723498 0.831917 didntLike 63275 8.385879 1.669485 didntLike 5569 4.875435 0.728658 smallDoses 51052 4.680098 0.625224 didntLike...Copy the code
2.2.1 Read text file data and construct matrix
def file2Matrix(filename): love_dictionary = {'largeDoses': 1, 'smallDoses': 0, 'didntLike': -1} fr = open('datingTestSet.txt') arrayOfLines = fr.readlines() numOfLines = len(arrayOfLines) dataMatrix = Zeros ((numOfLines, 3)) # data matrix classLabels = [] # array index = 0 for line in arrayOfLines: line = line.strip() listFromLine = line.split('\t') dataMatrix [index, :] = listFromLine[0:3] classLabels.append(love_dictionary.get(listFromLine[-1])) index += 1 return returnMat, classLabelsCopy the code
2.2.2 Data normalization
There is a large difference in the values of each dimension, and direct use will seriously affect the classification effect, so normalization is required: newValue = (oldvlue-min)/(max-min)
def autoNorm(dataSet): Min (1) maxVals = dataSet. Max (0) # Max (0) maxVals = dataSet. Max (0) # Max (0) Max (1) Return the maximum value of the row ranges = maxVals - minVals normDataSet = Zeros (Shape (dataSet)) m = normDataSet. Shape [0] normDataSet = dataSet - tile(minVals, (m, 1)) normDataSet = normDataSet / tile(ranges, (m, 1)) return normDataSetCopy the code
Finally, call the kNNClassify function to test. Is omitted
3. Advantages and disadvantages of the algorithm
3.1 the advantages
- Simple, easy to understand, easy to implement;
- Suitable for numerical attribute classification;
- Suitable for multi-modal problems (objects have multiple category labels), kNN performs better than SVM.
3.2 disadvantages
- When the sample is imbalanced, for example, the sample size of one class is large while the sample size of other classes is small, it may lead to the majority of the k neighbors of the sample when a new sample is input, and the classification deviation will occur.
- The amount of calculation is large, and the distance between each text to be classified and all known samples should be calculated.
4. Improve your strategy
The improvement strategies are mainly divided into two directions: classification efficiency and classification effect:
- Classification efficiency: the sample attributes are reduced in advance, and the attributes that have little impact on classification results are deleted. This algorithm is suitable for automatic classification of large sample size class domains, but it is easy to produce misclassification for small sample size class domains.
- Classification results: (1) The weight method (the neighbor with a small distance from the sample has a large weight) is adopted for improvement, such as the k-nearest neighbor method WAkNN (weighted Adjusted K nearest neighbor) with adjustable weight; ② According to the number of files in various categories in the training set, different numbers of nearest neighbors are selected to participate in the classification; ③ Class center algorithm, calculate the distance between the class center of each sample and the test data, divide to the nearest class.
The resources
- Machine Learning in Action