Supervised learning (python)

Why artificial intelligence and machine learning?

The future of humanity is ai/machine learning. Anyone who does not understand them will soon find themselves behind. Waking up in a world full of innovation feels like technology is more and more like magic. There are many methods and techniques to perform AI and machine learning to solve real-time problems, and supervised learning is one of the most commonly used methods.

What is supervised learning?

In supervised learning, we start by importing a data set containing training attributes and target attributes. The supervised learning algorithm learns the relationship between the training sample and its related target variables, and applies the learning relationship to classify the new input (no target).

To illustrate how supervised learning works, let’s take an example of predicting a student’s score based on the number of hours he studies.

In mathematics, Y is equal to f of X plus C

Where, f will be the relationship between the number of hours students prepare for the test;

X is INPUT (the number of hours he prepared);

Y is the output (mark the student who scored in the exam);

C is going to be random error.

The ultimate goal of supervised learning algorithm is to predict Y with maximum accuracy with a given new input X. Algorithmic engineers have invented several approaches to supervised learning, and we’ll explore some of the most common.

Based on a given data set, machine learning problems fall into two categories: classification and regression. If the given data has both input (training) and output (target) values, then this is a classification problem. If the data set has continuous values for attributes without any target tags, it is a regression problem. Such as:

Classification: Has an output tag, is it a cat or a dog?

Return: How much does the house cost?

classification

Take the example of a medical researcher who wants to analyze breast cancer data to predict which of three specific treatments a patient should receive. This data analysis task is called categorization, in which models or classifiers are built to predict category labels such as “process A,” “process B,” or “process C.”

Classification is a prediction problem, including classification prediction and classification disordered category label. This is a two-step process consisting of a learning step and a classification step.

The best way to classify

Some of the most commonly used classification algorithms

1.K- nearest neighbor;

2. Decision tree;

3. Naive Bayes;

4. Support vector machine;

In the learning step, the classification model establishes the classifier by analyzing the training set. In the classification step are the category labels that predict the given data. The data set tuples in the analysis and their associated class tags are divided into a training set and a test set. The tuples that make up the training set are analyzed from a randomly sampled data set. The remaining tuples form test sets and are independent of training tuples, which means they are not used to build classifiers.

The test set is used to estimate the prediction accuracy of the classifier. The accuracy of a classifier is the percentage of test tuples that the classifier correctly classifies. To achieve higher accuracy, the best way is to test different algorithms and try different parameters in each algorithm. The best one can be selected by cross-validation.

To select a good algorithm for a certain problem, the accuracy, training time, linearity, number of parameters and special cases must be considered for different algorithms.

Tutorial: Implementing KNN in SciKit-Learn based on IRIS data sets to classify flower types based on given input.

First, in order to apply our machine learning algorithm, we need to understand and explore a given data set. In this example, we use IRIS data sets imported from the SciKit-learn package.

Now let’s dive into the code and explore the IRIS data set.

Make sure you have Python installed on your machine. Also, use PIP to install the following packages:

pip install pandas
pip install matplotlib
pip install scikit-learn Copy the code

In this code, we learned about the properties of the IRIS dataset using several methods in Pandas.

from sklearn import datasets import pandas as pd import matplotlib.pyplot as plt # Loading IRIS dataset from scikit-learn object into iris variable. iris = datasets.load_iris() # Prints the type/type object of iris print(type(iris)) # <class 'sklearn.datasets.base.Bunch'> # prints the dictionary keys of iris data print(iris.keys()) #  prints the type/type object of given attributes print(type(iris.data), type(iris.target)) # prints the no of rows and columns in the dataset print(iris.data.shape) # prints the target set of the data print(iris.target_names) # Load iris training dataset X = iris.data # Load iris target set Y = iris.target # Convert datasets' type into dataframe df = pd.DataFrame(X, columns=iris.feature_names) # Print the first five tuples of dataframe. print(df.head())Copy the code

Output:

< class 'sklearn datasets. Base. Bunch' > dict_keys ([' data ', 'target', 'target_names',' DESCR ', 'Feature_names'])] <class' numpy.ndarray '> <class' numpy.ndarray '> (150, Petal width (cm) petal width (cm) 0, petal length (cm), petal width (cm) 0, petal width (cm 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2Copy the code

Scikit-learnIn K- nearest neighbor

An algorithm is considered a lazy learner if it only stores the tuples of the training set and waits for the test tuples to be given. It performs generalization only when it sees a test tuple so that it can classify tuples based on their similarity to stored training tuples.

The K-nearest neighbor classifier is a lazy learner.

KNN is based on analogical learning, which compares a given test tuple with a similar training tuple. The training tuple is described by n attributes, and each tuple represents a point in an N-dimensional space. Thus, all training tuples are stored in the n-dimensional pattern space. When given an unknown tuple, the K-nearest neighbor classifier searches the pattern space for k training tuples closest to the unknown tuple. The k training tuples are the K “nearest neighbors” of the k unknown tuple.

In the following code snippet, we import the KNN classifier from the Sklearn supply, apply it to our input data, and then classify the flowers.

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

# Load iris dataset from sklearn
iris = datasets.load_iris()

# Declare an of the KNN classifier class with the value with neighbors.
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model with training data and target values
knn.fit(iris['data'], iris['target'])

# Provide data whose class labels are to be predicted
X = [
    [5.9, 1.0, 5.1, 1.8],
    [3.4, 2.0, 1.1, 4.8],
]

# Prints the data provided
print(X)

# Store predicted class labels of X
prediction = knn.predict(X)

# Prints the predicted class labels of X
print(prediction)Copy the code

Output:

[1] 1

Here, 0 corresponds to Versicolor

1 the corresponding Virginic

2 the corresponding Setosa

Based on the given input, the machine uses KNN to predict that the two flowers are Versicolor.

KNN intuitive classification of IRIS data sets

Return to the

Regression is often referred to as determining the relationship between two or more variables. For example, consider that you have to predict a person’s income based on a given input number X.

The target variable here means that we care about the unknown variables of the prediction, and continuous means that there is no gap (discontinuity) in the values that Y can bear.

Forecasting revenue is a classic regression problem. Your input data should include all the information (called characteristics) that can predict income, such as how long he worked, his education, his position, and where he lives.

Popular regression models

Some commonly used regression models are:

· Linear regression

Logistic regression,

· Polynomial regression

Linear regression uses best-fit lines (also known as regression lines) to establish a relationship between the dependent variable (Y) and one or more independent variables (X).

In mathematics, h (xi) =βo+β1* xi + E, where βo is the intercept, β1 is the slope of the line, and e is the error term.

Graphically,

Logistic Regression is an algorithm used where the response variables are classified. The idea of Logistic regression is to find a relationship between features and the probability of a particular outcome.

In mathematics, p (X) = beta o + beta 1 * X, the p (X) = p (y | X) = 1

Graphically,

Polynomial regression is a form of regression analysis in which the relationship between the independent variable X and the dependent variable y is modeled as a polynomial of degree n in X.

Solve the linear regression problem

We have our data set X and the corresponding target value Y, and we use ordinary least squares to learn a linear model that we can use to predict a new Y, giving a previously invisible X, with as little error as possible.

The given data is divided into a training dataset and a test dataset. The training set has tags (feature loading), so the algorithm can learn from examples of these tags. The test set doesn’t have any labels, which means you don’t yet know the value of trying to predict.

We will consider one factor for training and apply linear regression methods to fit the training data, then use the test data set to predict the output.

Implement linear regression in Scikit-learn

from sklearn import datasets, linear_model import matplotlib.pyplot as plt import numpy as np # Load the diabetes dataset diabetes = datasets.load_diabetes() # Use only one feature for training diabetes_X = diabetes.data[:, np.newaxis, 2] # Split the data into training/testing sets diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] # Split the targets into training/testing sets diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) # Input data print('Input Values') print(diabetes_X_test) # Make predictions using the testing set diabetes_y_pred = regr.predict(diabetes_X_test) # Predicted Data print("Predicted Output Values") print(diabetes_y_pred)  # Plot outputs plt.scatter(diabetes_X_test, diabetes_y_test, color='black') plt.plot(diabetes_X_test, diabetes_y_pred, color='red', linewidth=1) plt.show()Copy the code

Output:

Input Values [[0.07786339] [-0.03961813] [0.01103904] [-0.04069594] [-0.03422907] [0.00564998] [0.08864151] [-0.03315126] [-0.05686312] [-0.03099563] [0.05522933] [-0.06009656] [0.00133873] [-0.02345095] [-0.07410811] [ 0.01966154][-0.01590626] [0.03906215] [-0.0730303]] Predicted Output Values [225.9732401 115.74763374 163.27610621 114.73638965 120.80385422 158.21988574 236.08568105 121.81509832 99.56772822 123.83758651 204.73711411 96.53399594 154.17490936 130.91629517 83.3878227 171.36605897 137.99500384 137.99500384 189.56845268 84.3990668]Copy the code

(Diabetes _X_ test, diabetes _y_PRED) the graphs between the predictions will be continuous on the online equation.

The end of the notes

Other Python packages for overseeing machine learning.

Scikit-learn, Tensorflow, Pytorch.

Dozens of ari cloud products limited time discount, quickly click on the coupon to start cloud practice!

This article is translated by @aike Lovely Life @Ali Yunqi Community Organization.

Supervised learning-with-Python

The tiger said eight ways.

The article is a brief translation. For more details, please refer to the original text.

classification

The best way to classify

Scikit-learnIn K- nearest neighbor

KNN intuitive classification of IRIS data sets

Return to the

The end of the notes

Related Posts

Image processing based on MATLAB GUI watershed image segmentation + gray geometric correction + motion image restoration

Big data learning introduction to the actual combat tutorial, carefully tidy up the 10,000-word long introduction, grandma read all said learned

Record Flink CEP Pattern not triggered problem (Table SQL processing watermark missing)