The sklearn extension packs built-in data sets
Data is the key to machine learning. In machine learning, we need to spend a lot of time to collect and sort out data. Reasonable and scientific data is the key to get good machine learning results. Generally speaking, the machine learning process of a classification problem needs to use four pieces of data content, respectively: - Training data, generally represented by 'train' - classification attribute of training data, generally represented by 'target' - test data, generally represented by 'test' - real classification attribute of test data, used to evaluate classifier performance, Sklearn is expected to contain a variety of useful data sets for learning and testing machine learning, including data for text processing, image recognition and other representative problems. The iris dataset used in this paper can also be found in 'datasets' module of SKlearn.Copy the code
KNN algorithm implementation
Without further ado, let's go directly to the code first and then go into detail.Copy the code
#-*-coding:utf-8 -*-
from sklearn import datasets
Import the built-in dataset module
from sklearn.neighbors import KNeighborsClassifier
Sklearn. neighbors (KNN)
import numpy as np
np.random.seed(0)
# Set random seed, default is' system time 'as parameter, so each call random module will generate a different number after setting the same
iris=datasets.load_iris()
# Import the data set of iris. Iris is a kind of structure with sample data and label data if it is supervised learning
iris_x=iris.data
# Sample data 150*4 2d data, representing 150 samples, each sample has 4 attributes of length and width of petal and calyx respectively
iris_y=iris.target
# long 150 for array, sample data tag
indices = np.random.permutation(len(iris_x))
The #permutation accepts a number (150) as an argument, resulting in a 1-149 array that is randomly shuffled, although it can also accept a 1-d array as an argument, resulting in a direct shuffling of the array
iris_x_train = iris_x[indices[:-10]]
140 samples were randomly selected as the training data set
iris_y_train = iris_y[indices[:-10]]
# And select the labels of the 140 samples as the labels of the training data set
iris_x_test = iris_x[indices[-10:]]
# The remaining 10 samples are used as test data sets
iris_y_test = iris_y[indices[-10:]]
# And take the remaining 10 samples corresponding labels as the labels of test data and
knn = KNeighborsClassifier()
Initialize a KNN classifier object
knn.fit(iris_x_train,iris_y_train)
The training method of this object is called. It takes two parameters: the training data set and its sample tag
iris_y_predict = knn.predict(iris_x_test)
Call the test method on this object and receive one parameter: the test data set
probility=knn.predict_proba(iris_x_test)
# Calculate the probability-based prediction of each test sample
neighborpoint=knn.kneighbors(iris_x_test[-1].5.False)
# Calculate the 5 points closest to the last test sample and return an array of the sequence numbers of these samples
score=knn.score(iris_x_test,iris_y_test,sample_weight=None)
# Call the scoring method of the object to calculate the accuracy
print('iris_y_predict = ')
print(iris_y_predict)
Output the results of the test
print('iris_y_test = ')
print(iris_y_test)
Output the correct labels of the original test data set for easy comparison
print 'Accuracy:',score
Output accuracy calculation results
print 'neighborpoint of last test sample:',neighborpoint
print 'probility:',probility
# result output:
iris_y_predict =
[1 2 1 0 0 0 2 1 2 0]
iris_y_test =
[1 1 1 0 0 0 2 1 2 0]
Accuracy: 0.9
neighborpoint of last test sample: [[ 75 41 96 78 123]]
probility: [[ 0. 1. 0.] [0. 0.4 0.6] [0. 1. 0.] [1. 0. 0.] [1. 0. 0.] [1. 0. 0.] [0. 0. 1.] [0. 1. 0.] [0. 0. 1.] [1. 0. 0. ]]
Copy the code
N_neighbors =5 int. KNN uses the weights='uniform' STR parameter to specify the size of the nearest neighbor sample to vote. 'Uniform' means equal proportion voting, 'distance' means inverse proportion voting, default parameter is' Uniform 'Copy the code