1. Linear regression

Linear regression is usually used to estimate actual values based on continuous variables (room rates, number of calls, total sales, etc.). We establish the relationship between independent and dependent variables by fitting the best line. This optimal line is called the regression line and is represented by the linear equation Y= a *X + b.

The best way to understand linear regression is to go back to childhood. What do you think a fifth grader would do if he were asked to rank his classmates from lighter to heavier without asking about their weight? He or she is likely to visually assess people’s height and body shape, combining these observable parameters to rank them. This is a real life example of using linear regression. In fact, the child found a relationship between height and body shape and weight that looked a lot like the equation above.

In this equation:

  • Y: Dependent variable

  • A: the slope

  • X: independent variable

  • B: the intercept

The coefficients A and b can be obtained by least square method.

See the following example. We find the best fitting line y=0.2811x+13.9. Given the height of a person, we can use this equation to figure out the weight.

The two main types of linear regression are unary linear regression and multiple linear regression. One variable linear regression is characterized by only one independent variable. Multiple linear regression, as its name suggests, has multiple independent variables. When looking for the best fit line, you can fit into multinomial or curve regression. These are called multinomial or curvilinear regression.

Python code

Logistic regression

Don’t be fooled by the name! This is a classification algorithm rather than a regression algorithm. The algorithm estimates discrete values (for example, binary values 0 or 1, yes or no, true or false) based on a known set of dependent variables. In simple terms, it estimates the probability of an event by fitting data into a logical function. Therefore, it is also called logistic regression. Because it estimates probabilities, its output values are between 0 and 1 (as expected).

Let’s understand this algorithm again with a simple example.

Suppose your friend asks you to solve a puzzle. There are only two consequences: you solve it or you don’t solve it. Imagine you have to solve a lot of problems to figure out what topic you’re good at. The result of the study would be something like this: If the problem were a tenth grade trig problem, you’d have a 70 percent chance of solving it. However, if the question is a fifth grade history question, you only get it right 30 percent of the time. That’s what logistic regression gives you.

Mathematically, the logarithm of the odds in the results uses a linear combination model of the predictive variables.

In this formula, p is the probability of the feature we’re interested in. It takes a value that maximizes the probability of observing the sample, rather than minimizing the sum of squares of error (as normal regression analysis does).

Now you might ask, why do we want to figure out the logarithm? In short, this approach is one of the best ways to copy a step function. I could go into more detail, but that would defeat the purpose of this guide.

Python code

3. KNN (K-Nearest Neighbor Algorithm)

The algorithm can be used for classification and regression problems. However, in the industry, the K-nearest neighbor algorithm is more commonly used for classification problems. K – nearest neighbor algorithm is a simple algorithm. It stores all cases, dividing new cases by most of the k surrounding cases. According to a distance function, the new case is assigned to the most common category among its K neighbors.

These distance functions can be European distance, Manhattan distance, Ming distance or Hamming distance. The first three distance functions are used for continuous functions, and the fourth function (hamming function) is used for categorizing variables. If K=1, the new case is placed directly into the category of the nearest case. Sometimes, choosing a value for K is a challenge when modeling with KNN.

More information: K — Introduction to The Nearest Neighbor Algorithm (Simplified Version)

We can easily apply KNN in real life. If you want to get to know a complete stranger, you might want to reach out to his close friends or circle for information.

Things to consider before choosing to use KNN:

  • The computation cost of KNN is very high.

  • Variables should be normalized before they are biased by higher-range variables.

  • Before KNN is used, more efforts should be made in the early processing such as outlier removal and noise removal.

4. Support vector machines

That’s one way to classify. In this algorithm, we plot each data point in n-dimensional space (N is the total number of features you have), and the value of each feature is the value of a coordinate.

For example, if we had only two features, height and hair length, we would plot these two variables in two dimensions, with each point having two coordinates (these coordinates are called support vectors).

Now, we’ll find a line that separates the two different sets of data. The distance between the two nearest points in the two groups and this line is optimized simultaneously.

The black line in the example above optimizes the data classification into two groups, and the distance between the closest points in the two groups (points A and B in the figure) to the black line meets the optimal condition. This line right here is our dividing line. Next, the test data falls on whichever side of the line, and we place it in that category.

See more: Simplification of support vector machines

Think of this algorithm as playing JezzBall in an N-dimensional space. A few minor changes to the game are needed:

  • Instead of having to draw a straight line horizontally or vertically, you can now draw a line or a plane at any Angle.

  • The purpose of the game becomes to divide the different colored balls into different Spaces.

  • The position of the ball will not change.

Python code

5. Naive Bayes

On the premise that the predicted variables are independent, naive Bayes classification method can be obtained according to Bayes’ theorem. In simpler terms, a naive Bayes classifier assumes that the properties of a classification are unrelated to the other properties of that classification. For example, if a fruit is round and red and about 3 inches in diameter, it might be an apple. Even if these properties depend on each other, or on the presence of other properties, naive Bayes classifiers assume that these properties independently suggest that the fruit is an apple.

Naive Bayesian models are easy to construct and very useful for large data sets. Although simple, naive Bayes’ performance transcends very complex classification methods.

Bayes’ theorem provides a from P (c), P (x) and P (x | c) calculate the posterior probability P (c | x) method. Look at this equation:

Here,

  • P (c | x) is the premise of known predictor variables (properties), the posterior probability of the class (target)

  • P(c) is the prior probability of the class

  • P (x | c) is a possibility, that is, under the premise of known class, the probability of predictor variables

  • P(x) is the prior probability of the predicted variable

Example: Let’s use an example to understand this concept. Below, I have a weather training set and the corresponding target variable “Play”. Now, we need to categorize the participants who will “play” and “not play” based on weather conditions. Let’s do the following steps.

Step 1: Transform the data set into a frequency table.

Step 2: Create the Likelihood table with probabilities like “when Overcast probability is 0.29, Likelihood of play is 0.64”.

Step 3: Now, use the naive Bayes equation to calculate the posterior probabilities for each class. The category with the highest posteriori probability is the predicted outcome.

Problem: If the weather is clear, participants can play. Is this statement correct?

We can use the method discussed to solve the problem. Then P (play | sunny) = P (sunny | play) * P (play)/P (clear)

We have P (sunny | play) = 3/9 = 0.33, P = 5/14 (clear) = 0.36, P = 9/14 = 0.64 (play)

Now, P (play | sunny) = 0.33 * 0.64/0.36 = 0.60, have greater probability.

Naive Bayes uses a similar approach, using different attributes to predict probabilities of different classes. This algorithm is commonly used for text classification and problems involving more than one class.

Decision tree

This is one of my favorite and most frequently used algorithms. This supervised learning algorithm is usually used for classification problems. Surprisingly, it applies to both categorical and continuous dependent variables. In this algorithm, we divide the population into two or more homogeneous groups. This is done by grouping as different groups as possible according to the most important attributes or independent variables. To learn more, read: Simplifying Decision Trees.

Source: statsexchange

As you can see in the image above, the crowd is divided into four different groups based on a variety of attributes, judging whether they will play or not. In order to divide the population into different groups, a number of techniques are used, such as Gini, Information Gain, Chi-Square, and entropy.

The best way to understand how decision trees work is to play Jezzball, a classic Microsoft game (see picture below). The ultimate goal of the game is to carve out as much space as possible without balls by building walls in a room where you can move walls.

Therefore, every time you divide a room with a wall, you are trying to create two different populations in the same room. Similarly, decision trees are trying to divide the population into as many different groups as possible.

See simplification of decision tree algorithms for more information

Python code

7. K-means algorithm

K – means algorithm is a kind of unsupervised learning algorithm, which can solve the clustering problem. The process of grouping data into a certain number of clusters (assuming K clusters) using the K-means algorithm is simple. Data points within a cluster are homogeneous and different from other clusters.

Remember finding shapes in ink stains? The K-means algorithm is similar to this activity in a way. Look at the shapes and stretch your imagination to find out how many clusters or populations there are.

How k-means algorithm forms clusters:

  1. The K-means algorithm selects K points for each cluster. These points are called the center of mass.

  2. Each data point forms a cluster with the nearest center of mass, that is, k clusters.

  3. Find the center of mass for each category based on the existing category members. Now we have our new center of mass.

  4. When we have a new center of mass, repeat steps 2 and 3. Find the center of mass closest to each data point and associate it with a new K-cluster. This process is repeated until the data converge, that is, when the center of mass does not change.

How to determine K value:

The K-means algorithm involves clusters, each with its own center of mass. The sum of squares of the center of mass and the distances between data points in a cluster forms the sum of the squares of the cluster. At the same time, when the sum of the squares of all clusters is added up, it forms the sum of the squares of the cluster scheme.

We know that as the number of clusters increases, K continues to decrease. However, if you graph the result, you will see that the sum of the squares of distances decreases rapidly. After a certain value, k, the rate of decrease slows down dramatically. Here, we can find the optimal value for the number of clusters.

Python code

Random forest

Random forest is a proper term for decision tree population. In the random forest algorithm, we have a series of decision trees (hence the name “forest”). In order to classify a new object according to its properties, each decision tree has a classification that is said to “vote” for that classification. This forest selected the category that received the most votes (of all trees) in the forest.

Each tree is grown like this:

  1. If the number of cases in the training set is N, samples are randomly selected from N cases using reset sampling method. This sample will serve as a training set for the “nurture” tree.

  2. If there are M input variables, define a number M <<M. M means that m variables are randomly selected from M, and the best shard among these M variables will be used to shard the node. The value of m remains constant during the planting of the forest.

  3. Plant each tree as much ground as possible without pruning.

Python


Gradient Boosting and AdaBoost algorithm

GBM and AdaBoost, two Boosting algorithms, are used when we need to process a lot of data to make a prediction with high predictive power. Boosting algorithm is an ensemble learning algorithm. It combines prediction results based on multiple base estimates to improve the reliability of a single estimate. Boosting these Boosting algorithms are generally effective in data science competitions such as Kaggl, AV Hackathon, and CrowdAnalytix.

Python code

GradientBoostingClassifier and random forests are two different kinds of boosting tree classifier. People often ask about the difference between the two algorithms.

10. Dimensionality reduction algorithm

At every possible stage, information capture has grown exponentially over the last four to five years. Companies, government agencies, and research organizations capture detailed information in addition to coping with new sources.

For example, e-commerce companies capture more detailed information about their customers: personal information, web browsing, their likes and dislikes, purchases, feedback and much more, paying more attention to you than the grocery store clerk around you.

As data scientists, the data we provide contains many characteristics. That sounds like good material for building a model that will stand up to graduate school, but there’s a challenge: how do you distinguish the most important variables from 1,000 or 2,000? In this case, dimensionality reduction and other algorithms (such as decision trees, random forests, PCA, and factor analysis) help us find these important variables based on correlation matrices, the proportion of missing values, and other factors.

Python code


Reprint. Original: http://blog.csdn.net/j2IaYU7Y/article/details/78988060