Minimalist Python takes you through the mysteries of classification and regression

Selected from TowardsDataScience

By Vihar Kurama

Heart of the machine compiles

Participation: Chen Yunzhu, Lu Xue

This paper introduces the basic supervised learning methods from two aspects of classification and regression, and uses Scikit-learn to demonstrate an example.

Why artificial intelligence and machine learning?

The future of the planet lies in artificial intelligence and machine learning. If ignorant of these technologies, people can quickly find themselves behind The Times. The world is changing so fast, there are incredible changes happening every day. In artificial intelligence and machine learning, there are many implementations and technologies that can solve real-time problems. Among them, supervised learning is one of the most common methods.

“Ai is all about presentation.” – Jeff Hawkins

What is supervised learning?

In supervised learning, we first import a data set containing training attributes and target attributes. The supervised learning algorithm will learn the relationship between the training sample and its target variables, and then apply the learned relationship to classify the new input without target attributes.

To illustrate how supervised learning works, consider the case of predicting student performance based on the number of hours spent studying.

The mathematical formula is as follows:

Y = f(X)+ C

Where, F represents the relationship between the preparation time and the test score. X is the input (length of study) and Y is the output (student’s score on the test). C stands for random error.

The ultimate goal of supervised learning algorithm is to predict the Y value of given new input X with maximum accuracy. There are several approaches to supervised learning, and we’ll explore some of the most common.

Based on a given data set, machine learning problems will fall into two categories: classification and regression. If a given data has both input (training) and output (target) values, it is a classification problem. If the data set has consecutive numeric attributes and no target tags, it is a regression problem.

Classification: Has the output label. Is it a Cat or Dog? Regression: How much will the house sell for?Copy the code

Classification problem

Let’s take an example. A medical researcher hopes to analyze breast cancer data to predict which of three treatments a patient should receive. This data analysis task is classified, where the model or classifier built needs labels that predict categories, such as “therapy 1,” “therapy 2,” and “therapy 3.”

Classification problems predict discrete and unordered category labels. This process is divided into two stages: learning stage and classification stage.

Classification methods and how to choose the most appropriate method

The most commonly used algorithms include:

1. K neighbor

2. The decision tree

Naive Bayes

4. Support vector machines

In the learning stage, the classification model constructs the classifier by analyzing the training set. In the classification phase, the model predicts the category labels for a given data. The analyzed dataset tuples and their associated category labels are separated into training sets and test sets. We randomly select some tuples from the data set to be analyzed to form the training set. The rest of the data is naturally the test set, and they are independent of each other, that is, the test set does not participate in the training process.

The test set is used to evaluate the predictive accuracy of the classifier. The accuracy of a classifier refers to the percentage of the test set in which the classifier makes correct predictions. To achieve higher accuracy, it is best to test different algorithms and tune for each one. Finally, the best classifier can be found through cross validation.

In order to choose a good algorithm for the task, we must consider the accuracy, training time, linearity, number of parameters and special cases of different algorithms.

KNN algorithm was implemented on IRIS dataset using Scikit-learn, and flower species were predicted according to given input.

First, we need to deeply understand and explore a given data set in order to apply machine learning algorithms. In this case, we used the IRIS data set imported from SciKit-learn. Let’s look at the code and analyze the data set.

Make sure you have Python installed on your computer. Then, use PIP to install the following package:

pip install pandaspip install matplotlibpip install scikit-learnCopy the code

In the code snippet below, we call several methods in Pandas to learn about the properties of the IRIS dataset.

Output:

< class 'sklearn datasets. Base. Bunch' > dict_keys ([' data ', 'target', 'target_names',' DESCR ', 'feature_names]] < class' numpy. Ndarray '> < class' numpy. Ndarray '> (150, 4) petal width (cm) petal width (cm)0 5.1 3.5 1.4 0.21 4.9 3.0 1.4 0.22 4.7 3.2 1.3 0.23 4.6 3.1 1.5 0.24 5.0 3.6 1.4 0.2Copy the code

K nearest Neighbor algorithm in Scikit-learn

An algorithm is a lazy learning algorithm if it saves only the tuples of the training set and then processes them after receiving the test tuples. The algorithm performs generalization only when the test data is received and classifies the test data based on the similarity between the test data and the saved training data.

K nearest neighbor classifier is a lazy learning algorithm.

KNN is based on analogy learning. Analogical learning is learning by comparing a given test tuple with a similar training tuple. The training tuple is described by n attributes. Each tuple represents a point in n-dimensional space. In this way, all training tuples are stored in n-dimensional pattern space. When an unknown tuple is entered, the K-nearest neighbor classifier searches the pattern space for k training tuples closest to the unknown tuple. The k training tuples are the k nearest neighbors of the unknown tuple.

“Closeness” is defined by distance measures, such as Euclidean distance. The appropriate value of K depends on the experiment.

In the code snippet below, we import the KNN classifier from Sklearn and use it for our input data, which is then used to classify flowers.

from sklearn import datasetsfrom sklearn.neighbors import KNeighborsClassifier# Load iris dataset from sklearniris = datasets.load_iris()# Declare an of the KNN classifier class with the value with neighbors.knn = KNeighborsClassifier(n_neighbors=6)# Fit the model with training data and target valuesknn.fit(iris['data'], Iris ['target'])# Provide data whose class labels are to be predictedX = [[5.9, 1.0, 5.1, 1.8], [3.4, 2.0, 1.1, 1],]# Prints the data providedprint(X)# Store predicted Class labels of Xprediction = knn.predict(X)# Prints the predicted class labels of Xprint(prediction)Copy the code

Output:

[1] 1Copy the code

here

0 corresponds to Versicolor (Miscellaneous Iris)

1 for Virginica

2 corresponds to Setosa (Iris Mountain)

Based on the given input, the flowers in both graphs are predicted to be Versicolor using KNN classifier.

View of KNN algorithm for IRIS dataset classification

Return to the

We often refer to the process of determining the relationship between two or more variables as regression. For example, predict someone’s income from a given input data X.

The target variable here is the unknown variable we want to predict, and continuity refers to the absence of gaps (discontinuities) between the Y values.

Forecasting revenue is a classic regression problem. Your input data should include all the information (also called characteristics) that can be used to predict income, such as hours worked, education, job title, residence, etc.

The regression model

The most commonly used regression model is as follows:

Linear regression
Logistic regression
Polynomial regression

Linear regression uses a best-fit line (i.e., a regression line) to establish an association between the dependent variable Y and one or more independent variables X.

The mathematical formula is as follows:

H (xi) = βo + β1 times xi + e

β O is the intercept, β1 is the slope of the regression line, and e is the error term.

The graphic representation is as follows:

Logistic regression algorithm is applied in the case that the dependent variable belongs to a certain category. The idea of Logistic regression is to find the relationship between features and specific output probability.

The mathematical formula is as follows:

P (X) = βo + β1 * X

Among them,

p(x) = p(y = 1 | x)

The graphic representation is as follows:

Polynomial regression is a form of regression analysis. The relationship between the independent variable X and the dependent variable y is modeled in the NTH degree polynomial form of x.

Solve the linear regression problem

For data set X and corresponding target value Y, we train a linear model using the ordinary least square method. With this model, we can predict the output value y of a given unknown input x with the smallest possible error.

The given data is separated into training sets and test sets. The training set is labeled (loaded eigenvalues), so the algorithm can learn from these labeled samples. The test set is unlabeled, meaning you don’t know what values to predict.

We take a feature to be trained as an example, use linear regression to fit the training set, and then use the test set to make the prediction.

Implement linear regression in Scikit-learn

from sklearn import datasets, linear_modelimport matplotlib.pyplot as pltimport numpy as np# Load the diabetes datasetdiabetes = datasets.load_diabetes()# Use only one feature for trainingdiabetes_X = diabetes.data[:, np.newaxis, 2]# Split the data into training/testing setsdiabetes_X_train = diabetes_X[:-20]diabetes_X_test = diabetes_X[-20:]# Split the targets into training/testing setsdiabetes_y_train = diabetes.target[:-20]diabetes_y_test = diabetes.target[-20:]# Create linear regression objectregr = linear_model.LinearRegression()# Train the model using the training setsregr.fit(diabetes_X_train, diabetes_y_train)# Input dataprint('Input Values')print(diabetes_X_test)# Make predictions using the testing setdiabetes_y_pred = regr.predict(diabetes_X_test)# Predicted Dataprint("Predicted Output Values")print(diabetes_y_pred)# Plot outputsplt.scatter(diabetes_X_test, diabetes_y_test, color='black')plt.plot(diabetes_X_test, diabetes_y_pred, color='red', linewidth=1)plt.show()Copy the code

Output:

Input Values[[0.07786339] [-0.03961813] [0.01103904] [-0.04069594] [-0.03422907] [0.00564998] [0.08864151] [-0.03315126] [-0.05686312] [-0.03099563] [0.05522933] [-0.06009656][0.00133873] [-0.02345095] [-0.07410811] [ 0.01966154][-0.01590626] [0.03906215] [-0.0730303]]Predicted Output Values[225.9732401 115.74763374 163.27610621 114.73638965 120.80385422 158.21988574 236.08568105 121.81509832 99.56772822 123.83758651 204.73711411 96.53399594 154.17490936 130.91629517 83.3878227 171.36605897 137.99500384 137.99500384 189.56845268 84.3990668 ]undefinedGraph between (diabetes_X_test, Diabetes_y_pred) predictions will be continuous on the line equation.(diabetes_X_test, diabetes_Y_preD) Predictions are linear and continuous.Copy the code

(diabetes_X_test, diabetES_Y_PRED) The prediction graph is linear and continuous.

The original link: https://towardsdatascience.com/supervised-learning-with-python-cf2c1ae543c1

This article is compiled for machine heart, reprint please contact this public number for authorization.

✄ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Join Heart of the Machine (full-time reporter/intern) : [email protected]

Contribute or seek coverage: [email protected]

Advertising & Business partnerships: [email protected]

Minimalist Python takes you through the mysteries of classification and regression

Related Posts

Hutool guide API

What Java source code to see after you harvest a lot, code thinking and ability to have a greater improvement

Illustration of Go language memory allocation