Big data requires basic algorithms

preface

Mathematics is like an octopus. It has tentacles that can touch almost every subject. Although most people have systematic learning in school, they just use it to expand their thinking logic without in-depth research and application. But if you want to pursue a math-related research or career, you will have to work hard at math. If you’ve completed a math degree or some technical degree, you probably know if everything you learned was necessary.

You may be wondering: How much math does it take to do big data? In this article, we will take a brief look at the basic algorithms needed to master big data.

Machine learning algorithms. There are hundreds of them. The depth of coverage for each type of algorithm is beyond the scope of this article, which discusses the math you need to know about the following common algorithms:

  • Naive Bayes
  • Linear Regression
  • Logistic Regression
  • K-means clustering (K-means clustering algorithm)
  • Decision Trees

Naive Bayes’ Classifiers

Naive Bayes classifier is an algorithm based on Bayes theory in the classification algorithm set.

Bayes theory

Bayesian theory involves calculating the probability of one event based on the probability of another. The mathematical representation of Bayesian theory can be written as follows:

[Note] : BOTH A and B are events, and P(B) is not 0

The formula above looks a little complicated, but we can break it down. Basically, as long as we give that event B is true, then we can figure out the probability of event A, which is also called evidence.

  • P(A|B)It’s conditional probability. – inBIs the probability of event A occurring in the true case.
  • P(B|A)It’s also conditional probability. – givenAIn the eventBThe probability of occurrence is true.
  • P (A)Is the eventAA prior (prior probability, for example, the probability of occurring before evidence), evidence is an attribute value of an unknown event (in this case, the event)B). In a nutshellP(A)andP(B)They observe each other independentlyAandBThe probability.

We will use examples to deepen our understanding.

case

People often use Bayes’ theorem unconsciously in their daily life, as in the following example

Extrapolate why most people think the Northeast is a heavy drinker?

The conditions are known:

  • P(A)= Probability of meeting a heavy drinker.
  • P(B)= Probability of meeting people from the northeast.
  • P(B|A)= The probability of meeting a heavy drinker who is from the Northeast

Three-bottle probability calculated the northeast people: P (A | B) = P (A) * P (B | A)/P (B) the probability of found three-bottle man = * found three-bottle man is the probability of the northeast people/meet the probability of the northeast people.

From the above formula, we can also learn how to reduce the prejudice against northeast people who drink a lot:

  • Reduce the probability of meeting a heavy drinker, and the probability of meeting a heavy drinker who is from the northeast (encounter a heavy drinker and where people are difficult to control).
  • Increases the chance of meeting northeasterners. (We can choose to go to the northeast where there are many people)

Above we are talking about applications in life, so what are the applications in big data processing and machine learning? Please continue to look at the following example:

Sorting junk mail

Let’s say we have 100,000 messages, and each message has been marked for spam. From these data, we can calculate:

  • P(A)= Probability of spam, spam/all mails.
  • P(B)= probability of the word M appearing in the email, occurrence of the wordMMail/all mail.
  • P(B|A)= Words appear in spamMProbability, spam contains wordsMThe number of mails/all the junk mails.

Can be: A word M email is spam probability P (A | B) = P (A) * P (B | A)/P (B) in word probability of M X = email spam words appear in the probability of M/is spam.

For spam learning process is the process of calculation of P (A | B). Multiple words or combinations of words are generally tried until a word M1, M2, etc. is found with a probability greater than expected (e.g., 0.7, 0.88, 0.93, etc.). You can then calculate the resulting words to determine if the new message is spam.

Linear Regression

Linear regression is the ability to use a straight line to describe the relationship between data more accurately. This allows you to predict a simple value when new data comes in.

Linear regression model

In machine learning, mathematical functions are called models. In the case of linear regression, the model can be expressed as:

Where A1, A2… , an represents data set-specific parameter values, x1, x2… , xn represents the element column we choose to use in the model, and Y represents the target column. The goal of linear regression is to find the optimal parameter values that best describe the relationship between feature columns and target columns. In other words: find the line that best fits the data so that you can extrapolate the trend line to predict future results. To find the optimal parameters for a linear regression model, we want to minimize the sum of the squares of the model residuals.

The basic steps of building a regression model

  • According to the prediction target, the independent and dependent variables are determined
  • Draw scatter diagram to determine regression model type
  • Model parameters were estimated and regression model was established
  • Test the regression model
  • Use regression model to predict

The following example is about a simple linear regression model.

Simple linear regression – a single linear regression equation.

y=a+bx+e

  • Y: Dependent variable
  • X: independent variable
  • A: Constant term (the y-intercept of the regression line)
  • B: Regression coefficient (slope of regression line)
  • E: Random error (the influence of random factors on dependent variables)

Case study – height and weight

As shown below, we randomly obtained the height and weight of some male students

Serial number Height (cm) Weight (kg)
1 165 60
2 170 64
3 172 66
4 177 68
5 180 70
6 157 55
. . .

The regression equation that predicts the weight of a boy according to his height is obtained, and the weight of a boy with height of 173cm is predicted. The hash points are as follows

Solution: 1. Select height as independent variable X and weight as dependent variable Y to make scatter diagram:

  • The scatter diagram shows that there is a good linear correlation between height and weight, so linear regression equation can be used to describe the relationship between them.

  • Regression equation: y = 0.603x-38.623
  • Plug in to find a weight of 173cm

Python source code is as follows:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression

if __name__ == "__main__":
    data = pd.read_csv('./linear.csv')  # TV, Radio, Newspaper, Sales
    print(data)
    x = data[['height'.'weight']]
    Draw a scatter plot and find the correlation coefficients of x and y
    plt.scatter(data.height, data.weight)
    data.corr()
    print(data.corr())
    # Estimate model parameters and establish regression model
    First import the LinearRegression solution class of simple LinearRegression
    This class is then used for modeling to obtain the model variables of the lrModel
    lrModel = LinearRegression()
    # Select the independent and dependent variables
    x = data[['height']]
    y = data[['weight']]
    plt.xlabel('X')
    Set the Y label
    plt.ylabel('Y')
    Call the fit method of the model to train the model
    # This training process is the process of parameter solving
    # And fit the model
    lrModel.fit(x, y)
    # Test the regression model
    lrModel.score(x, y)
    print(lrModel.score(x, y))
    # Use regression model for prediction
    lrModel.predict([[160], [170]])
    print(lrModel.predict([[160], [170]]))
    # check the intercept
    alpha = lrModel.intercept_[0]
    # query parameters
    beta = lrModel.coef_[0][0]

    test = alpha + beta * np.array([167, 170])
    print(alpha, beta, test)
    y_test = beta * x + alpha

    # plt.plot(x, y, 'g-', linewidth=6, label=' real data ')
    plt.plot(x, y_test, 'r-', linewidth=2, label='Forecast data')
    plt.show()
Copy the code

conclusion

Modeling characteristics
  • Modeling speed, do not need very complex calculation, in the case of a large amount of data is still very fast.
  • The understanding and interpretation of each variable can be given in terms of coefficients.
  • It’s sensitive to outliers.
Modeling steps – Sklearn modeling process
  • LinearRegression() lrModel = sklearn.linear_model.linearregression ()
  • Lrmodel.fit (x,y)
  • Lrmodel.score (x,y)
  • LrModel. Predict (x)

Logistic Regression

Logistic Regression is a linear Regression algorithm, but other linear regressions are different. There are only two predictive results of Logistic Regression, that is, true (1) and false (0). Therefore, Logistic regression, despite its name, is a linear model for classification rather than regression. Therefore, logistic regression algorithm is often suitable for data classification.

In order to map the data fitting results to 1 and 0, we need to construct a function that has only 0 and 1 results. In fact, the fitting function of logistic regression algorithm is called Sigmond function. The output value of this function is only 0 and 1, and it is a smooth function. We also call this function a logical function. The function is expressed as follows:

  • Y is the decision value
  • X is the eigenvalue
  • E is the natural log.

So why does the sigmoid function always return a value between 0 and 1? Remember, raising any number from algebra to a negative exponent is the same as raising the reciprocal of that number to the corresponding positive exponent.

We can draw its function graph in Python

As can be seen from the figure, the value domain of Y is (0,1), so the objects with the corresponding x attribute whose decision function value is greater than or equal to 0.5 can be classified as positive samples, and the objects with the corresponding x attribute whose decision function value is less than 0.5 can be classified as negative samples. So you can dichotomize the sample data.

The code for the image above is as follows:

import matplotlib.pyplot as plt
import numpy as np

def sigmoid(x):
    # return sigmoid directly
    return 1 / (1 + np.exp(-x))

if __name__ == '__main__':
    # param: Starting point, ending point, spacingX = np.arange(-10, 10, 0.2) y = sigmoid(x) plt.plot(x, y,'r-', linewidth=2)
    plt.show()
Copy the code

The purpose of Logistic Regression is to find the best fitting parameters of a nonlinear function Sigmoid, which is completed by optimization algorithm in the solving process. The advantage of this algorithm is that it is easy to understand and implement, and the calculation cost is not high

Case 1 – Iris data collection

The iris dataset is perhaps the most famous pattern recognition test

The dataset included three iris categories with 50 samples for each category. One of the categories is linearly separable from the other two, while the other two are not linearly separable.

The dataset consists of 150 rows with 1 sample per row. Each sample has 5 fields, respectively

  • Calyx length
  • Calyx width
  • Petal length
  • Petals width
  • Category (3 categories)
    • Iris Setosa mountain Iris
    • Iris Versicolor
    • Iris Virginica

Let’s look at the implementation of the code

import numpy as np import pandas as pd from sklearn import preprocessing from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.pipeline import Pipeline import matplotlib as mpl import matplotlib.pyplot as plt import  matplotlib.patches as mpatchesif __name__ == "__main__":
    path = './iris.data'  Data file path
    data = pd.read_csv(path, header=None)
    data[4] = pd.Categorical(data[4]).codes
    x, y = np.split(data.values, (4,), axis=1)
    # print 'x = \n', x
    # print 'y = \n', y
    Use only the first two column features
    x = x[:, :2]
   
    # y.raven () is the fetch matrix
    lr = LogisticRegression(C=1e5)
    # Pipeline optimizes the reuse of logistic regression accuracy parameter sets on new data sets (such as test sets)
    # Pipelining is more of an innovation of programming skills than of algorithms.
    # Steps that can be placed in a Pipeline may have (1) features that need to be standardized as the first step
    # (2) Since it is a classifier, classifier is also necessary, naturally, it is the last link
    The PolynomialFeatures class provides the properties of the PolynomialFeatures class
    # lr = Pipeline([('sc', StandardScaler()),
    # ('poly', PolynomialFeatures(degree=1)),
    # ('clf', LogisticRegression())])
    lr.fit(x, y.ravel())
    y_hat = lr.predict(x)
    y_hat_prob = lr.predict_proba(x)
    np.set_printoptions(suppress=True)
    print('y_hat = \n', y_hat)
    print('y_hat_prob = \n', y_hat_prob)
    print('Accuracy: %.2f%%' % (100*np.mean(y_hat == y.ravel())))
    # drawing
    N, M = 200, 200     # How many values are sampled horizontally and vertically
    x1_min, x1_max = x[:, 0].min(), x[:, 0].max()   The range of column 0
    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()   The range of column 1
    t1 = np.linspace(x1_min, x1_max, N)
    t2 = np.linspace(x2_min, x2_max, M)
    x1, x2 = np.meshgrid(t1, t2)                    Generate grid sampling points
    print(x1, x2)
    x_test = np.stack((x1.flat, x2.flat), axis=1)   # test points


    # mpl.rcParams['font.sans-serif'] = ['simHei']
    # mpl.rcParams['axes.unicode_minus'] = False
    cm_light = mpl.colors.ListedColormap(['#77E0A0'.'#FF8080'.'#A0A0FF'])
    cm_dark = mpl.colors.ListedColormap(['g'.'r'.'b'])
    y_hat = lr.predict(x_test)                  # predicted
    y_hat = y_hat.reshape(x1.shape)                 # make it the same shape as the input
    print("y_hat", y_hat)
    plt.figure(facecolor='w')
    # This function tells the two network matrices of X1 and x2 and the corresponding predictive value y_hat to be plotted on the picture. It can be found that the output is three color blocks, and the distribution represents the classified area
    plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)     # Display of predicted values
     Hash point of sample
    plt.scatter(x[:, 0], x[:, 1], c=y.flat, edgecolors='k', s=50, cmap=cm_dark)
    plt.xlabel(u'Iris height', fontsize=14)
    plt.ylabel(u'Iris width', fontsize=14)
    print(x1_min, x1_max, x2_min, x2_max)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.grid()
    patchs = [mpatches.Patch(color='#77E0A0', label='Iris-setosa'),
              mpatches.Patch(color='#FF8080', label='Iris-versicolor'),
              mpatches.Patch(color='#A0A0FF', label='Iris-virginica')] PLT. Legend (handles = patchs, fancybox = True, framealpha = 0.8)# plt.title(u' Logistic regression classification effect of Iris -- standardization ', fontsize=17)
    plt.show()

Copy the code
Partial code parsing
  • Draws data hash points
# The first way
Get the first column data
X = [x[0] for x in DD]  
Get the second column data
Y = [x[1] for x in DD]  
plt.scatter(X[:50], Y[:50], color='red', marker='o', label='setosa') # Top 50 samples
plt.scatter(X[50:100], Y[50:100], color='blue', marker='x', label='versicolor') # middle 50
plt.scatter(X[100:], Y[100:],color='green', marker='+', label='Virginica') # Last 50 samples

# the second
plt.scatter(x[:, 0], x[:, 1], c=y.flat, edgecolors='k', s=50, cmap=cm_dark)   
Copy the code
  • Lr = LogisticRegression(C=1e5) lr.fit(x, y.savage ()) initialize the LogisticRegression model and train it.
  • x1_min, x1_max = x[:, 0].min(), x[:, 0].max()

    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()

    The obtained two columns of iris data correspond to calyx length and calyx width, and the coordinates of each point are (x,y). First take the minimum value and maximum value of the first column (length) of the two-dimensional array X to generate the array, then take the minimum value and maximum value of the second column (width) of the two-dimensional array X to generate the array, and finally use the meshGrid function to generate two grid matrices X1 and X2, as shown below:
    [[4.3 4.31809045 4.3361809... 7.8638191 7.88190955 7.9]Copy the code

[4.3 4.31809045 4.3361809… 7.8638191 7.88190955 7.9] [4.3 4.31809045 4.3361809… 7.8638191 7.88190955 7.9] [4.3 4.31809045 4.3361809… 7.8638191 7.88190955 7.9] [4.3 4.31809045 4.3361809… 7.8638191 7.88190955 7.9] [4.3 [[2.2.2…. 2.2.2.] [2.0120603 2.0120603 2.0120603 2.0120603… [2.0241206 2.0241206 2.0241206 2.0241206 2.0241206 2.0241206]… [4.3758794 4.3879397 4.3879397 4.3879397 4.3879397 4.3879397 4.3879397 4.3879397 4.3879397 4.3879397 [4.4 4.4 4.4… 4.4 4.4 4.4]]

  • X. raven () calls the ravel() function to convert the two matrices of XX and YY into a one-dimensional array

  • 0 0 Y_hat = y_hat. Shape Shape Is 0 0 0 Is changing its y_hat to two features (length and width) The output is as follows

    [[1. 1. 1.... 2. 2. 2.] [1. 1. 1.... 2. 2. 2.] [1. 1. 1.... 2. 2. 2.]... [0. 0. 0.... 2. 2. 2.] [0. 0. 0.... 2. 2. [0. 0.... 2. 2.]Copy the code
  • Lr. Predict (x_test) Perform a prediction function for data

  • plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)

    This function tells the two network matrices of X1 and x2 and the corresponding predictive value y_hat to be plotted on the picture. It can be found that the output is three color blocks, and the distribution represents the classified area

We also recommend universities to learn the latest examples from Sklearn open source knowledge website.

Case 2

Logistic regression example

If you want to dig deeper into concepts, I recommend studying probability theory as well as discrete mathematics or practical analysis.

K-means clustering (K-means clustering algorithm)

K-means clustering algorithm is an unsupervised machine learning algorithm used to classify unlabeled data (i.e., data with no defined categories or groups). The algorithm works by looking for groups in the data, the number of which is represented by the variable K. It then iterates through the data, assigning each data point to one of k groups based on the provided characteristics. K-means clustering relies on the concept of distance throughout the algorithm to “assign” data points to a cluster. If you’re not familiar with the concept of distance, it refers to the amount of space between two given items. In mathematics, any function that describes the distance between any two elements of a set is called a distance function or measure.

define

Clustering is a process of classifying and organizing data members that are similar in some aspects. Clustering is a technology to discover such internal structure, and clustering technology is often called unsupervised learning.

K-means clustering is the most famous partition clustering algorithm, and it has become the most widely used among all clustering algorithms due to its simplicity and efficiency. Given a set of data points and the required number of clustering k, which is specified by the user, the k-means algorithm repeatedly divides the data into K clusters according to a certain distance function.

There are two types of metrics: Euclidean metrics and taxi metrics.
Euclidean metric definition

Euclidean measures are defined as follows:

(x1, y1)
(x2, y2)
Euclid

Taxi metric

Taxi indicators are as follows:

This one is not that complicated; In fact, you only need to know addition and subtraction, and know the basics of algebra, to know the distance formula. But to get a clear understanding of the basic types of geometry for each of these measures, I recommend a geometry class that contains both Euclidean and non-Euclidean geometry. To gain an in-depth understanding of metrics and the implications of measurement space, you need to read mathematical analysis and take a real analysis course.

algorithm

Complete the following steps

  • First randomly takekObject as the initial cluster center.
  • Calculate the distance between each object and each seed cluster center, and assign each object to the cluster center closest to it.
  • Cluster centers and the objects assigned to them represent a cluster.
  • Once all objects have been assigned, the cluster center of each cluster is recalculated based on the existing objects in the cluster.
  • Repeat the above steps until a termination condition is met, which could be the following:
    • No (or minimum) objects are reassigned to different clusters.
    • No (or minimum) clustering centers changed again.
    • The sum of error squares is locally minimal.

Decision Trees

The basic concept

A decision tree is a flowchart-like tree structure that uses a branching approach to illustrate each possible outcome of a decision. Each node in the tree represents a test for a particular variable – each branch is the result of that test.

composition

  • The decision point is the choice of several possible solutions, that is, the best solution finally selected. If the decision belongs to multi-level decision, there can be multiple decision points in the middle of the decision tree, and the decision point at the root of the decision tree is the final decision scheme.
  • The state node represents the economic effect (expected value) of alternative schemes. By comparing the economic effect of each state node, the best scheme can be selected according to certain decision-making criteria. The branches derived from the state nodes are called probability branches, and the probability number represents the number of possible natural states. The probability of the occurrence of this state should be indicated on each branch.
  • Result node, the profit and loss value obtained by each scheme in various natural states is marked at the right end of the result node.

Decision trees rely on a theory called information theory to determine how they are constructed. In information theory, the more one knows about a subject, the less new information one can learn. One of the key measures of information theory is called entropy.

About entropy

Entropy originated in physics and is used to measure the disorder of a thermodynamic system.

Information entropy: Shannon is a big name! Knowledge in information theory. In information theory, entropy measures the amount of information, which is a measure of the uncertainty of a random variable. The greater the entropy, the greater the uncertainty.

Entropy can be written like this:

I recommend this link, which explains in detail, with demo decision tree explanation

  • P(x)Is the probability of occurrence of features in the data set. From the definition:0 or less Entropy (X) or less log (n)When the random variable takes only two values, i.eXThe distribution ofP(X=1)=p, X = 1 – p (X = 0), 0 or less p 1 or less Entropy as: Entropy (X) = – plog2 (p) – (1 – p) log2 (1 – p).

The higher the entropy value is, the higher the type of data mixing is, which implies that the more possible changes of a variable (instead, it has nothing to do with the specific value of the variable, but only with the type of the value and the probability of occurrence), the more information it carries. Entropy is a very important concept in information theory, and many machine learning algorithms use this concept. It should be noted that any base B can be used for logarithms; But we use 2, e and 10. You might notice a symbol like S. This is the summation notation, which means adding any function other than the sum as many times in succession as possible. How many times depends on the lower limit and upper limit of the sum. After calculating entropy, we can use the information gain to construct a decision tree to determine which split will reduce entropy the most. The information gain formula is as follows:

  • 1) Generally speaking, the larger the information gain is, the greater the “purity improvement” obtained by partitioning with attribute A is. Therefore, we can use the information gain to select the partitioning attribute of decision tree.

  • 2) The famous ID3 decision tree learning algorithm selects partition attributes based on information gain.

Case dem

Basic algebra and probability is where you really need to scratch the surface of the decision tree. If you want an in-depth conceptual understanding of probability and logarithms, I would recommend courses in probability theory and algebra

conclusion

Mathematics is ubiquitous in data science. While some data science algorithms can sometimes feel like magic, we can understand the details of many algorithms without algebra and basic probability and statistics. If you don’t want to learn any math? Technically, you can rely on machine learning libraries like SciKit-Learn to do all this for you. But it’s helpful for data scientists to have a good understanding of the math and statistics behind these algorithms so they can choose the best algorithms for their problems and data sets to make more accurate predictions. So embrace the pain and dive into the math! It’s not as hard as you might think, and we’ve even created some classes on these topics to help you get started:

  • Probability and Statistics
  • Linear Algebra for Machine Learning

Refer to the link

  • Math in Data Science
  • En.wikipedia.org/wiki/Naive_…
  • Gerardnico.com/wiki/data_m…
  • Scikit-learn.org/stable/modu…
  • Code implementation
  • Principle and realization of decision tree algorithm
  • Machine learning in action (iii) — Decision trees
  • LogisticRegression
  • MatPultLIB
  • Understand logistic regression
  • Statistical mining those things (6) – powerful logistic regression (theory + case)
  • Clustering algorithm: K-means algorithm