“This is the 11th day of my participation in the First Gwen Challenge 2022.First challenge in 2022”.

Machine learning

preface

With the rise of deep learning, machine learning appears more and more frequently in our sight. We all know that machine learning is about letting machines learn from data themselves. But there’s not much to say about how you learn, or how you learn.

This article will not involve too much mathematical knowledge, from the level of popular science to explain some principles of machine learning. Several algorithms commonly used in machine learning are analyzed, and scikit-learn is used to implement a few small examples.

Machine learning

Machine learning is based on existing experience, using a set of determined algorithms, let the machine automatically learn a set of parameters that can predict the unknown data well. Experience is our training data, algorithm is the machine learning algorithm, and parameter is our machine learning model. The focus of this article is to introduce some common algorithms.

Supervised learning

Machine learning can be divided into supervised learning and unsupervised learning. Today we mainly understand some algorithms related to supervised learning. The so-called supervised learning is to use a group of known categories of samples to adjust the parameters of the classifier to achieve the required performance process, also known as supervised training or teacher learning. For example, if we want to train a model to predict the weather, we need a lot of data, and that data has to include both eigenvalues and target values. The target value is the real category information.

Or if we want to train a model to predict gender, we need a set of data that includes personal information and gender.

Here’s an example:

The name age height weight gender
Zack 21 180 70 male
Alice 22 168 50 female
.

Our goal is to train a model that meets the requirements based on the above data.

KNN algorithm

KNN algorithm Chinese name is K adjacent algorithm, KNN idea is very simple, but also very in line with our common sense. The idea of the KNN algorithm is that we find k of the data that are closest to the input data, and then we look at the categories of the K data. The maximum number of categories that occur is the category we entered.

For example, we want to guess a friend’s area. Here’s what we now know:

The name coordinates area
zack 0, 0 The east lake area
alice 6, 6 lake
atom 2, 2 The east lake area
alex 2, 1 The east lake area
rudy 7, 6 lake
cloud 2, 3 The east lake area

So now we know that nyx is 3 comma 3, and we want to guess nyx position. We can separately calculate the distance between NYx and each person, and the calculation formula is as follows:


d i s t a n c e = ( x 1 x 2 ) 2 + ( y 1 y 2 ) 2 distance = \sqrt{(x_1-x_2)^2+(y_1-y_2)^2}

Which (x1, y1) (x_1, y_1) (x1, y1) and (x2, y2) (x_2, y_2) (x2, y2) are the coordinates of the two people. The calculation results are as follows:

The name distance area
zack sqrt(18) The east lake area
alice sqrt(18) lake
atom sqrt(2) The east lake area
alex sqrt(5) The east lake area
rudy 5 lake
cloud 1 The east lake area

If k=1, Cloud is the closest to NYx, so we infer that NYx is in the East Lake area. If k=3, then nyx is closest to Cloud, Atom, and Alex. All three of them are in the East Lake area, so we speculate that NYX is in the East Lake area. In KNN model, k value is a hyperparameter, which needs to be selected manually. Let’s look at a specific case:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
Load the data set
datasets = load_iris()
# Split data set
X_train, X_test, y_train, y_test = train_test_split(
    datasets.data,
    datasets.target,
    test_size=0.2,
    random_state=1.)# create model
knn = KNeighborsClassifier(n_neighbors=5)
# fill data
knn.fit(X_train, y_train)
# Predict unknown data
y_predict = knn.predict(X_test)
# Evaluation model
score = knn.score(X_test, y_test)
print(score)
Copy the code

In the above code, we tested the iris data set, and the final accuracy was 100. There are several parameters we can set when creating KNN classifiers, which are not explained here. The most important of these is the n_NEIGHBORS parameter, which is the specific value of k in KNN. The rest of the parameters can be consulted by the reader.

The decision tree

The idea of a decision tree is also very close to life, which is to ask questions to determine categories. It’s a lot like Microsoft’s Mind reader, which asks no more than 10 questions to determine who you have in mind. In fact, mind reading is a multi-category problem. Here’s a picture:

In the picture, we use two questions to determine that an object belongs to the animal above.

We call the tree formed by the above questioning process a decision tree. If the sequence of questions is set well enough, we can predict a lot. Let’s look at the specific code below:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
Load the data set
datasets = load_iris()
# Split data set
X_train, X_test, y_train, y_test = train_test_split(
    datasets.data,
    datasets.target,
    test_size=0.2,
    random_state=1.)# Create a decision tree model
tree = DecisionTreeClassifier()
# Training model
tree.fit(X_train, y_train)
# prediction
y_predict = tree.predict(X_test)
# assessment
score = tree.score(X_test, y_test)
print(score)
Copy the code

The decision tree may be overfitted when there are some abnormal data in the data. There are a few steps we can take to reduce the possibility of overfitting. Here we can use max_depth to limit the tree depth when creating the DecisionTreeClassifier to reduce the possibility of overfitting.

Random forests

Random forest is built on decision tree. If we train a decision tree directly, it is easy to overfit. But we train multiple decision trees, and we use the average result of multiple trees to determine the final result, so that we can achieve better results, and at the same time, we will not overfit. Below, we use decision tree and random forest to train a large data set simultaneously:

from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
Load the data set
datasets = load_breast_cancer()
# Split data set
X_train, X_test, y_train, y_test = train_test_split(
    datasets.data,
    datasets.target,
    test_size=0.2,
    random_state=1,
)
model1 = DecisionTreeClassifier(max_depth=3)
model2 = RandomForestClassifier(max_depth=3)
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
score1 = model1.score(X_test, y_test)
score2 = model2.score(X_test, y_test)
print(score1, score2)
Copy the code

Here, we use the data set of breast cancer, and limit the depth of decision tree to 3 and the depth of random forest to 3 during the test. The final results are as follows:

0.9210526315789473 0.9473684210526315
Copy the code

As you can see, the accuracy of random forest is higher. This is not absolute, but in most cases the effect of random forest is more stable.

Naive Bayes

Naive Bayes is an efficient algorithm supported by probability theory and widely used in text classification.

The key to naive Bayes is bayes’ formula:


P ( B i A ) = P ( B i ) P ( A B i ) j = 1 n P ( B j ) P ( A B j ) P(B_i|A) = \frac{P(B_i)P(A|B_i)}{\sum_{j=1}^nP(B_j)P(A|B_j)}

According to the formula we can calculate the probability of BiB_iBi occurring under the condition of event A. We can understand that event A is “the word ‘cloud computing’ appears in the text”, event B1B_1B1 is “article for science”, event B2B_2B2 is “article for education”.

Then we can calculate the probability respectively, if P (B1 ∣ A) P (B_1 | A) P (B1 ∣ A) is greater than P (B2 ∣ A) P (B_2 | A) P (B2 ∣ A), then we speculate that the article is technology.

Let’s look at the code:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
Load the data set
datasets = load_breast_cancer()
# Split data set
X_train, X_test, y_train, y_test = train_test_split(
    datasets.data,
    datasets.target,
    test_size=0.2,
    random_state=1,
)
nb = MultinomialNB()
nb.fit(X_train, y_train)
score = nb.score(X_test, y_test)
print(score)
Copy the code

This is pretty much the same as the ones up here.

Support vector machine

The mathematical theory of support vector machine is very complicated. Simply speaking, it is to find several feature vectors in feature space. We call these vectors support vectors. Then support vector is used to select the decision boundary, so as to achieve the classification effect. Here is the code:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
Load the data set
datasets = load_breast_cancer()
# Split data set
X_train, X_test, y_train, y_test = train_test_split(
    datasets.data,
    datasets.target,
    test_size=0.2,
    random_state=1,
)
svm = LinearSVC()
svm.fit(X_train, y_train)
score = svm.score(X_test, y_test)
print(score)

Copy the code

When using support vector machine, we need to select a suitable kernel. The kernel parameters can be used to control SVC creation. Here, linear, polynomial, RBF and SigmoID can be selected according to the application scenario.

Linear regression

Linear regression is a model that we are familiar with. We often encounter linear regression problems in middle school. In high school word problems, we are often given two sets of data and then asked to predict the other sets of data. For a function of one variable, we can set an equation:


y = k x + b y = kx + b

And then from these two sets of data, we can solve for k and b, so that we can predict the other sets of data. The idea of linear regression is to find an equation like this, except we’re looking for multiple variables, and we’re not looking for it by solving a system of equations.

Here is the code:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
Load the data set
datasets = load_breast_cancer()
# Split data set
X_train, X_test, y_train, y_test = train_test_split(
    datasets.data,
    datasets.target,
    test_size=0.2,
    random_state=1,
)
lr = LinearRegression()
lr.fit(X_train, y_train)
score = lr.score(X_test, y_test)
print(score)
print(lr.intercept_, lr.coef_)
Copy the code

After training the model, we can check the parameters of the model by intercept_ and coef_.

Logistic regression

In general, linear regression is more suitable for linear problems, while logistic regression adds activation function on the basis of linear regression. Originally, the range of linear regression results is continuous real numbers, while classification problems usually result in integers. At this point you can map the real numbers to the integers by activating the function. Let’s say we get 2.7, we map it to 3. If the result is 1.2, we map the result to 1. In this way, the linear model can be used to solve the classification problem, and the specific code is as follows:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
Load the data set
datasets = load_iris()
# Split data set
X_train, X_test, y_train, y_test = train_test_split(
    datasets.data,
    datasets.target,
    test_size=0.2,
    random_state=1,
)
lr = LogisticRegression(max_iter=3000)
lr.fit(X_train, y_train)
score = lr.score(X_test, y_test)
print(score)
Copy the code

The neural network

In fact, neural network can be regarded as a series of logistic regression, and the images of logistic regression can be expressed as follows:

Delta is the activation function. We then combine them together using multiple logistic regressions to create something like this:

This is what we call a neural network. The specific code is as follows:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
Load the data set
datasets = load_iris()
# Split data set
X_train, X_test, y_train, y_test = train_test_split(
    datasets.data,
    datasets.target,
    test_size=0.2,
    random_state=1,
)
mlp = MLPClassifier(solver='lbfgs', hidden_layer_sizes=[4.2], random_state=0)
mlp.fit(X_train, y_train)
score = mlp.score(X_test, y_test)
print(score)

Copy the code

When creating a neural network, we can control the depth and width of the network with the hidden_layer_sizes parameter. Where [4, 2] indicates that there are 4 neurons in the first layer and 2 neurons in the second layer. We can set the depth and level as needed, but the first level must match the number of features and the last level must match the number of categories.


This article was first published in GitChat, shall not be reproduced without authorization, reprint to contact GitChat.