directory

Data preparation

The data show

Model structures,

Build the model

K nearest neighbor algorithm

Model to evaluate

Summary and induction

Each Wen Yiyu


Data preparation

The iris data comes with the Python Scikit-Learn datasets module, and we just need to call this data to open the door to machine learning.

from sklearn.datasets import load_iris
iris_dataset = load_iris()
Copy the code

The iris object returned by load_iris is a Bunch object, much like a dictionary, containing keys and values:

How can we tell the difference? Take a look at this example:

In [1]: from sklearn.datasets import base ... : buch = base.Bunch(A=1,B=2,c=3) In [2]: type(buch) Out[2]: sklearn.datasets.base.Bunch In [3]: buch Out[3]: {'A': 1, 'B': 2, 'c': 3} # dictionary-like format In [4]: buch['A'] # Through dictionary similar methods can also call Out[4]: 1 In [5]: buch.A # object. Properties, with this method also can call Out [5] : 1 In [6] : dt = {' A ', 1, 'B' : 2, 'C' : 3} [7] : In type (dt) Out [7] : dict In [8] : dt [' A '] Out [8] : 1 In [9] : Dt.A # But not dictionaries, This is the biggest difference between them -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- AttributeError Traceback (most recent call last) <ipython-input-9-7b8328c57719> in <module>() ----> 1 dt.ACopy the code

That’s the difference. There are things you need to know, not necessarily master

The data show

Print (" dataset['data'][:5] ", iris_dataset['data'][:5])Copy the code

print("Type of data:", type(iris_dataset['data']))
Copy the code

Since our data type type is now an array type for Numpy, we use index slicing to constrain our data.

Model structures,

In the process of machine learning, model selection and construction are very important. A good model can make our data more valuable.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)
Copy the code

Data in SciKit-Learn is usually represented by a capital X, while labels are represented by a lowercase Y. This was inspired by the standard mathematical formula f(x)=y, where x is the input to the function and y is the output. We use capital X because the data is a two-dimensional array (matrix) and lowercase Y because the target is a one-dimensional array (vector), as is the convention in mathematics.

The train_test_split function in Scikit-learn can scramble a dataset and split it. This function takes 75% of the row data and tags as the training set, and the remaining 25% of the data and tags as the test set. The ratio of training to test sets can be arbitrary, but using 25% of the data as a test set is a good rule of thumb. Such allocation principles in model training and testing are relatively intelligent science.

To ensure that running the same function multiple times yields the same output, we specify the seeds of a random number generator using the random_state parameter. The output of this function is fixed, so the output of this line of code is always the same.

We can look at our specific data type, so that we can obviously understand why two variables receive, as explained in the argument above, because it is a two-dimensional array, corresponding to the data

Observe data and check data

Before building a machine learning model, it is often a good idea to examine the data to see whether tasks can be easily done without machine learning, or whether the information needed is contained in the data. One of the best ways to examine data is to visualize it. One way to visualize this is to draw scatter plots. The data scatter plot plots each data point as a point on the graph with one feature as the X-axis and the other feature as the Y-axis. Unfortunately, computer screens only have two dimensions, so we can only draw two features (or maybe three) at a time. It is difficult to plot data sets with more than three features in this way. One way to solve this problem is to plot a pair plot so that all features can be viewed in pairs.

iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)

pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15),
                           marker='o', hist_kwds={'bins': 20}, s=60,
                           alpha=.8, cmap=mglearn.cm3)
Copy the code

By observing us, we can find that different labels also show the characteristic of “birds of a feather flock together”, which shows that machine learning can distinguish well and is also suitable for classification algorithms

Parameter interpretation

Dataframe: Figure size=(15,15) image area size, inches in unit. Marker = ‘0’ point shape, 0 is a circle, 1 is ¥hist_kwds={‘ bins’ :50} parameter tuples of the histogram on the diagonal s=60 the size of the plot point alpha=.8 image transparency, generally take (0,1) cmap=mglearn. Cm3 mylearn practical function library, mainly for some beautify the graph and other private functions

Build the model

K nearest neighbor algorithm

The meaning of k in the K-nearest neighbor algorithm is that we can consider any k neighbors closest to the new data point in the training set (for example, the closest 3 or 5 neighbors), rather than just the nearest one. We can then make predictions with the largest number of categories of these neighbors. The most important parameter of KNeighborsClassifier is the number of neighbors, which is set to 1.

The KNN object encapsulates the algorithm, including both the algorithm using training data to build the model and the new data points

Predictive algorithms. It also includes information that the algorithm extracts from the training data. For the KNeighborsClassifier,

It only holds the training set.



According to the prediction of our model, this new iris belongs to category 0, that is to say, it belongs to setosa variety. Here, we can also automatically ask our user to input, and the result can be directly obtained.

Model to evaluate

The model can be measured by calculating accuracy, which is the proportion of flowers predicted to be correct:

We can use the SCORE method of KNN objects to calculate the accuracy of the test set:

Summary and induction

We conceived a task to use physical measurements of irises to predict their species. We build the model using a measured data set annotated by an expert who has given us the correct variety of flowers, so it’s a supervised learning problem. There are three varieties: Setosa, versicolor or virginica, so it’s a triclassification problem. In classification problems, possible varieties are called classes, and the variety of each flower is called its label.

The Iris dataset contains two NumPy arrays: one containing data, called X in SciKit-Learn; One that contains the correct output or expected output is called Y. Array X is a two-dimensional array of features, one row for each data point and one column for each feature. Array Y is a one-dimensional array that contains a category label that is a number between 0 and 2 for each sample.


Given a training data set, k-nearest neighbor algorithm finds K instances closest to the new input instance in the training data set. Most of these K instances belong to a certain class, and then classifies the input instance into this class. (This is similar to the idea of majority rule in real life.)

Here, we use the K-nearest neighbor algorithm to build the model. Of course, we can also use other algorithm models to make predictions, such as logistic regression……

Conditions for using k-nearest Neighbor algorithm:

1. You need a training data set, which contains various feature values and corresponding label values. All features need to be normalized before use.

2. Use the training data set to classify the data to be classified:

Calculate the first K values with the shortest distance between the data to be predicted and the training data set according to the Euclidean distance, and then label corresponding to the first K values

For example, select the first k corresponding labels :[‘dog’,’dog’,’dog’,’fish’], then the result is dog.

K-nearest Neighbor algorithm features:

Advantages: High calculation accuracy, not affected by outliers.

Disadvantages: High computational complexity and space complexity

Applicable to: numerical class with Lable

Each Wen Yiyu

Some graduated at 21 and didn’t find a job until they were 27. Some people have it all right out of school. There are people in their early 20s who didn’t go to college and are doing things they love. Some people clearly love each other but can not be together, in fact, everything in life, depends on our schedule; Some may be way ahead of us, some may be behind us, but everything has its own pace; 30 is not married, but as long as happy is also a kind of happiness; Be patient, be practical; As Einstein said, is not each something is out of meaningful, also is not each a meaningful thing can be calculated, what really matters is: break the traditional thinking, obtain spiritual freedom, we want to create our own a meaningful life, don’t envy, not envy, not affected by anything!