The author | Chen Leihui beans ()

Introduction to the

Machine learning can be divided into three categories according to its tasks or applications:

1. Supervised Learning (SL). The working principle of this algorithm is to use labeled training data to learn the mapping function of input variable images to output variable images, in other words, to solve the images in the equation pictures. Further, supervised learning can be subdivided into the following three categories:

  • Regression: To predict a value, such as rainfall and housing prices. A basic algorithm is Linear Regression

  • Classification: predicting a tag, such as “sick” or “healthy”, what kind of animal is on the picture, etc., basic algorithms include Logistic Regression, Naive Bayes, K-nearest Neighbors (KNN).

【 the 】 : Ensembling can also be classified as supervised learning, which combines the predictions of several separate weak machine learning models to produce more accurate predictions. The more basic algorithms are Bagging with Random Forests and Boosting with XGBoost

2. Unsupervised Learning (UL). The principle of this algorithm is to learn the underlying structure of data from unlabeled training data. Further, unsupervised learning can be subdivided into the following three categories:

  • Association: find the probability of items appearing simultaneously in the set. For example, by analyzing the supermarket shopping basket, it is found that beer is always bought together with diapers (the story of beer and diapers). The basic algorithm is Apriori

  • Clustering: grouping data so that objects within a group are more similar than objects between groups. A more basic algorithm is k-means

  • Dimensionality Reduction: Reduces the number of variables in a data set while ensuring that important information is not lost. Dimension reduction can be achieved by feature extraction method and feature selection method. Feature extraction is to perform transformation from high-dimensional space to low-dimensional space, and feature selection is to select a subset of the original variables. The basic algorithm is PCA

3. Reinforcement Learning (DL) allows agent to determine the best behavior in the next step by Learning the behavior that can obtain the greatest reward according to the current environment.

implementation

The algorithms listed above are simple and commonly used, and sciKit-Learn enables model training, prediction, evaluation, and visualization in just a few lines of code. There are many wonderful answers about the principle of the algorithm, which will not be described here, but only the implementation and visualization of the code.

Linear Regression

It assigns optimal weights to variables to create a straight line or a plane or higher dimensional hyperplane that minimizes the error between the predicted value and the true value. LinearRegression – simplification can be obtained from the article – zhihu. The following takes unary linear regression as an example to give code implementation.

import matplotlib.pyplot as plt import numpy as np from sklearn import datasets from sklearn.model_selection import Train_test_split # Linear Regression from sklearn import linear_model from sklearn.metrics import mean_squared_error # 1. Prepare data lr_X_data, lr_y_data = datasets.make_regression(n_samples=500,n_features=1,n_targets=1,noise=2) # feature for dimension 1 # 2. Lr_X_test, lr_y_train, lr_y_test = train_test_split(lr_X_data, Lr_y_data, test_size=0.3) # 3. LinearRegression() lr_model.fit(lr_X_train, lr_y_train) # 4. Predicted data Lr_y_pred = Lr_model. Predict (lr_X_test) # 5. Lr_mse = mean_squared_error(lr_y_test, lr_y_pred) print("mse:", lr_mse) # 6. Figure ('Linear Regression') plt.title('Linear Regression') plt.scatter(lr_X_test, lR_Y_test, color='lavender', Marker ='o') plt.plot(lr_X_test, lr_Y_pred, color='pink', lineWidth =3) plt.show() # print info mse: 4.131366697554779Copy the code

Logistic Regression

It says regression, but it’s actually a dichotomous algorithm. It fits the data into a logit function, so it is called logit regression. To put it simply, based on a set of given variables, logistic function is used to predict the probability of the event and give an output between 0 and 1. Specific principle reference: in human words to explain clearly Logistic regression – simplification of the available article – Zhihu, the code implementation is given below.

import matplotlib.pyplot as plt import numpy as np from sklearn import datasets from sklearn.model_selection import Train_test_split # Logistic Regression from sklearn import linear_model # 1. Seed (123) logit_X_data = Np.random. Normal (size=1000) logit_y_data = (logit_X_data>0).astype(np.float) prepare data NP.random. Seed (123) logit_X_data = NP.random. logit_X_data[logit_X_data>0]*=5 logit_X_data+=.4*np.random.normal(size=1000) logit_X_data=logit_X_data[:,np.newaxis] # 2. Logit_X_train, logit_X_test, logit_y_train, logit_y_test = train_test_split(logit_X_data, logit_y_data, Test_size # 3 = 0.3). Classifier logit_model. Fit (logit_X_train,logit_y_train) # 3. Predict data logit_y_pred = logit_model. Predict (logit_X_test) # 5. Score (logit_X_test,logit_y_pred) print("accuracy:", logit_acc) # 5. [:,np.newaxis] def model(x): return 1/(1+np.exp(-x)) loss=model(logit_X_view*logit_model.coef_+logit_model.intercept_).ravel() plt.figure('Logistic Regression') plt.title('Logistic Regression') plt.scatter(logit_X_train.ravel(), logit_y_train, color='lavender',zorder=17) plt.plot(logit_X_view, loss, color='pink',linewidth=3) lr_model=linear_model.LinearRegression() lr_model.fit(logit_X_train,logit_y_train) plt.plot(logit_X_view, lr_model.predict(logit_X_view), color='blue', linewidth=3) plt.legend(('Logistic Regression','Linear Regression'),loc='lower right',fontsize='small') # print info Accuracy: 1.0Copy the code

Naive Bayes

Naive Bayes is a classification method based on Bayes’ theorem, which assumes that a feature in a class is independent of other features. This model is not only very simple, but also performs better than many highly complex classification methods. Specific principle reference: Simple Bayes algorithm principle summary – Liu Jianping Pinard, the code implementation is given below.

import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from Sklearn. datasets import make_classification # Naive Bayes task for classification, n_classes=4 import sklearn.naive_bayes as nb # 1. Prepare data nb_X_train, nb_Y_train = make_classification(n_features=2, n_REDUNDANT =0, n_informative=2, random_state=1, n_clusters_per_class=1, n_classes=4) # 2. Structural training and test sets in l, r = nb_X_train [: 0]. The min () - 1, nb_X_train [: 0] Max () + 1 b, t = nb_X_train [: 1] min () - 1, nb_X_train [:, 1].max() + 1 n = 1000 grid_x, grid_y = np.meshgrid(np.linspace(l, r, n), np.linspace(b, t, n)) nb_X_test = np.column_stack((grid_x.ravel(), grid_y.ravel())) # 3. Nb_X_train (nb_X_train, nb_y_train) # 4. Nb_y_pred = NB_model. Predict (nb_X_test) # 5. 0 Visible Grid_z = nb_y_pred. shape(grid_x.shape) plt.figure('Naive Bayes') plt.title('Naive Bayes') plt.pcolormesh(grid_x, 0) grid_y, grid_z, cmap='Blues') plt.scatter(nb_X_train[:, 0], nb_X_train[:, 1], s=30, c=nb_y_train, cmap='pink') plt.show()Copy the code

K-Nearest Neighbors

This is a machine learning algorithm for classification and regression (mainly for classification). It takes into account different centers of mass and uses Euclidean functions to compare distances. The results are then analyzed and each point sorted into groups to optimize it so that it is placed with all the closest points. It classifies the data using the majority votes of k nearest neighbors. Specific original reference: K nearest Neighbor method (KNN) principle summary – Liu Jianping Pinard, the following code is given.

import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from Sklearn. datasets import make_classification # Naive Bayes task for classification, n_classes=4 import sklearn.naive_bayes as nb # 1. Prepare data nb_X_train, nb_Y_train = make_classification(n_features=2, n_REDUNDANT =0, n_informative=2, random_state=1, n_clusters_per_class=1, n_classes=4) # 2. Structural training and test sets in l, r = nb_X_train [: 0]. The min () - 1, nb_X_train [: 0] Max () + 1 b, t = nb_X_train [: 1] min () - 1, nb_X_train [:, 1].max() + 1 n = 1000 grid_x, grid_y = np.meshgrid(np.linspace(l, r, n), np.linspace(b, t, n)) nb_X_test = np.column_stack((grid_x.ravel(), grid_y.ravel())) # 3. Nb_X_train (nb_X_train, nb_y_train) # 4. Nb_y_pred = NB_model. Predict (nb_X_test) # 5. 0 Visible Grid_z = nb_y_pred. shape(grid_x.shape) plt.figure('Naive Bayes') plt.title('Naive Bayes') plt.pcolormesh(grid_x, 0) grid_y, grid_z, cmap='Blues') plt.scatter(nb_X_train[:, 0], nb_X_train[:, 1], s=30, c=nb_y_train, cmap='pink') plt.show()Copy the code

Decision Tree

Traverse the tree and compare important features to identified conditional statements. Whether it falls into the left child or the right child depends on the outcome. In general, the more important feature is closer to the root, which can handle both discrete and continuous variables. Specific principle reference: simple understanding of decision tree algorithm (a) – the core idea – Yizhen article – Zhihu, the following code is given.

import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from Sklearn. Datasets import make_classification # K-nearest Neighbors n_classes=4 from sklearn.neighbors import KNeighborsClassifier # 1. Prepare data knn_X_train, KNN_Y_train = make_classification(n_features=2, n_REDUNDANT =0, n_informative=2, random_state=1, n_clusters_per_class=1, n_classes=4) # 2. Structural training and test sets in l, r = knn_X_train [: 0]. The min () - 1, knn_X_train [: 0] Max () + 1 b, t = knn_X_train [: 1] min () - 1, knn_X_train[:, 1].max() + 1 n = 1000 grid_x, grid_y = np.meshgrid(np.linspace(l, r, n), np.linspace(b, t, n)) knn_X_test = np.column_stack((grid_x.ravel(), grid_y.ravel())) # 3. Knn_class = KNeighborsClassifier(n_neighbors=5) knn_model.fit(knn_X_train, knn_y_train) # 4. Predicted data knn_y_pred = knn_model. Predict (knn_X_test) # 5. 1 Grid_z = KNn_Y_pred. shape(grid_x.shape) plt.figure(' K-nearest Neighbors') plt.title(' K-nearest Neighbors') plt.pcolormesh(grid_x, grid_y, grid_z, cmap='Blues') plt.scatter(knn_X_train[:, 0], knn_X_train[:, 1], s=30, c=knn_y_train, cmap='pink') plt.show()Copy the code

Random Forest

A random forest is a collection of decision trees. Random sampling data points are constructed by tree, random sampling feature subset is segmented, and each tree provides a classification. The category with the most votes wins in the forest, the final category for the sites. Specific reference: original sole | article, read the interpretation of the random forest and implementation – the article data science research institute, tsinghua university – zhihu, the realization of the code is given below.

import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification # Decision Tree from sklearn.tree import DecisionTreeClassifier # 1. Prepare data dt_X_train, dt_Y_train = make_classification(n_features=2, n_REDUNDANT =0, n_informative=2, random_state=1, n_clusters_per_class=1, n_classes=4) # 2. Structural training and test sets in l, r = dt_X_train [: 0]. The min () - 1, dt_X_train [: 0] Max () + 1 b, t = dt_X_train [: 1] min () - 1, dt_X_train [:, 1].max() + 1 n = 1000 grid_x, grid_y = np.meshgrid(np.linspace(l, r, n), np.linspace(b, t, n)) dt_X_test = np.column_stack((grid_x.ravel(), grid_y.ravel())) # 3. Dt_model = DecisionTreeClassifier(max_depth=4) dt_model.fit(dt_X_train, dt_y_train) # 4. Predict data dT_y_pred = dt_model.predict(dt_X_test) # 5. 0 New Grid_Z = dT_Y_pred. shape(grid_x.shape) plt.figure('Decision Tree') plt.title('Decision Tree') plt.pcolormesh(grid_x, grid_y, grid_z, cmap='Blues') plt.scatter(dt_X_train[:, 0], dt_X_train[:, 1], s=30, c=dt_y_train, cmap='pink') plt.show()Copy the code

Support Vector Machines

It maps data to points in space, so that points of different categories can be separated by the widest possible interval. For the data to be predicted, it is first mapped to the same space, and the corresponding category can be obtained according to which side of the interval it falls on. Specific original reference: read this article you do not understand SVM you will hit me – SMON article – Zhihu, the following code is given.

import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification # SVM from sklearn import svm # 1. Prepare data svm_X_train, svM_Y_train = make_classification(n_features=2, n_REDUNDANT =0, n_informative=2, random_state=1, n_clusters_per_class=1, n_classes=4) # 2. Structural training and test sets in l, r = svm_X_train [: 0]. The min () - 1, svm_X_train [: 0]. Max () + 1 b, t = svm_X_train [:, 1].min() - 1,svm_X_train[:, 1].max() + 1 n = 1000 grid_x, grid_y = np.meshgrid(np.linspace(l, r, n), np.linspace(b, t, n)) svm_X_test = np.column_stack((grid_x.ravel(), grid_y.ravel())) # 3. Svm_model = SVM.SVC(kernel=' RBF ', gamma=1, C=0.0001). Fit (svm_X_train, svm_y_train) # 4. Predict data svM_Y_pred = SVM_model. predict(svm_X_test) # 5. 0 Visible Grid_z = SVM_Y_pred. shape(grid_x.shape) plt.figure('SVM') plt.title('SVM') plt.pcolormesh(grid_x, grid_y, grid_z, cmap='Blues') plt.scatter(svm_X_train[:, 0], svm_X_train[:, 1], s=30, c=svm_y_train, cmap='pink') plt.show()Copy the code

K-Means

The data is divided into K cluster clusters, so that each data point belongs to the cluster cluster corresponding to its nearest mean (i.e., cluster center, CentroID). Finally, the data objects with high similarity degree are divided into the same class cluster, and the data objects with high dissimilarity degree are divided into different class clusters. Specific principle reference: using words to speak out fast clustering Kmeans – simplified available article – Zhihu, the following code is given.

import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from Sklearn.datasets. Samples_generator import make_blobs # k-means task for n_classes=5 from sklearn.cluster import KMeans # 1.  Prepare data kmeans_X_data, kmeANS_Y_data = make_blobs(n_samples=500, centers=5, Cluster_STD =0.60, random_state=0) # 2. The training model KmeANS_model = KMeans(N_clusters =5) KMeANS_model.fit (kmeans_X_data) # 3. Prediction model kmeANS_y_pred = kmeans_model. Predict (kmeans_X_data) # 4. Figure (' k-means ') PLT. title(' K-means ') plT. scatter(kmeans_X_data[:,0], kmeans_X_data[:, 1], s=50) plt.scatter(kmeans_X_data[:, 0], kmeans_X_data[:, 1], c=kmeans_y_pred, s=50, cmap='viridis') centers = kmeans_model.cluster_centers_ plt.scatter(centers[:,0], centers[:, 1], c='red', s=80, marker='x') plt.show()Copy the code

PCA

As the name implies, PCA helps us find the main components of the data. The principal components are basically linear and unrelated vectors, and k principal components are selected to represent the data to achieve the purpose of dimensionality reduction. Specific principle reference: how to explain easily what is PCA principal component analysis? – Zhihu, the code implementation is given below.

import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification # PCA from sklearn.decomposition import PCA from sklearn.datasets import load_iris # 1. Pca_data =load_iris() pca_X_data= pCA_data.data pCA_y_data = pCA_data.target # 2. Pca_model =PCA(n_Components =2) # 3. Reduced_X = pCA_model. Fit_transform (pca_X_data) # 4. Visualization red_x, red_y = [], [] blue_x, blue_y = [], [] green_x, green_y = [], [] for I in range (len (reduced_X) : if pca_y_data [I] = = 0: red_x.append(reduced_X[i][0]) red_y.append(reduced_X[i][1]) elif pca_y_data[i]==1: blue_x.append(reduced_X[i][0]) blue_y.append(reduced_X[i][1]) else: green_x.append(reduced_X[i][0]) green_y.append(reduced_X[i][1]) plt.figure('PCA') plt.title('PCA') plt.scatter(red_x,red_y,c='r') plt.scatter(blue_x,blue_y,c='b') plt.scatter(green_x,green_y,c='g') plt.show()Copy the code

conclusion

So far, it gives the realization of 9 kinds of machine learning algorithms, and the subject can further understand and be familiar with the algorithm through some practical cases. Foreign Kaggle and Ali Yuntianchi are good ways to gain project experience.