The sklearn extension packs built-in data sets

Data is the key to machine learning. In machine learning, we need to spend a lot of time to collect and sort out data. Reasonable and scientific data is the key to get good machine learning results. Generally speaking, the machine learning process of a classification problem needs to use four pieces of data content, respectively: - Training data, generally represented by 'train' - classification attribute of training data, generally represented by 'target' - test data, generally represented by 'test' - real classification attribute of test data, used to evaluate classifier performance, Sklearn is expected to contain a variety of useful data sets for learning and testing machine learning, including data for text processing, image recognition and other representative problems. The iris dataset used in this paper can also be found in 'datasets' module of SKlearn.Copy the code

KNN algorithm implementation

Without further ado, let's go directly to the code first and then go into detail.Copy the code
#-*-coding:utf-8 -*-
from sklearn import datasets 
Import the built-in dataset module

from sklearn.neighbors import KNeighborsClassifier 
Sklearn. neighbors (KNN)

import numpy as np 
# Set random seed, default is' system time 'as parameter, so each call random module will generate a different number after setting the same

# Import the data set of iris. Iris is a kind of structure with sample data and label data if it is supervised learning 
# Sample data 150*4 2d data, representing 150 samples, each sample has 4 attributes of length and width of petal and calyx respectively 
# long 150 for array, sample data tag

indices = np.random.permutation(len(iris_x)) 
The #permutation accepts a number (150) as an argument, resulting in a 1-149 array that is randomly shuffled, although it can also accept a 1-d array as an argument, resulting in a direct shuffling of the array
iris_x_train = iris_x[indices[:-10]]
 140 samples were randomly selected as the training data set
iris_y_train = iris_y[indices[:-10]] 
# And select the labels of the 140 samples as the labels of the training data set
iris_x_test = iris_x[indices[-10:]]
 # The remaining 10 samples are used as test data sets
iris_y_test = iris_y[indices[-10:]] 
# And take the remaining 10 samples corresponding labels as the labels of test data and

knn = KNeighborsClassifier()  
Initialize a KNN classifier object,iris_y_train) 
The training method of this object is called. It takes two parameters: the training data set and its sample tag

iris_y_predict = knn.predict(iris_x_test) 
Call the test method on this object and receive one parameter: the test data set

# Calculate the probability-based prediction of each test sample

# Calculate the 5 points closest to the last test sample and return an array of the sequence numbers of these samples

# Call the scoring method of the object to calculate the accuracy

print('iris_y_predict = ')  
Output the results of the test

print('iris_y_test = ')
Output the correct labels of the original test data set for easy comparison

print 'Accuracy:',score  
Output accuracy calculation results

print 'neighborpoint of last test sample:',neighborpoint

print 'probility:',probility

# result output:
iris_y_predict = 
[1 2 1 0 0 0 2 1 2 0]
iris_y_test = 
[1 1 1 0 0 0 2 1 2 0]
Accuracy: 0.9
neighborpoint of last test sample: [[ 75  41  96  78 123]]
probility: [[ 0.   1.   0.] [0.   0.4  0.6] [0.   1.   0.] [1.   0.   0.] [1.   0.   0.] [1.   0.   0.] [0.   0.   1.] [0.   1.   0.] [0.   0.   1.] [1.   0.   0. ]]
Copy the code
N_neighbors =5 int. KNN uses the weights='uniform' STR parameter to specify the size of the nearest neighbor sample to vote. 'Uniform' means equal proportion voting, 'distance' means inverse proportion voting, default parameter is' Uniform 'Copy the code