In addition to mastering basic skills of statistics, database, data analysis methods, thinking and data analysis tools, an excellent data analyst also needs to master some ideas of data mining to help us dig out valuable data, which is also one of the gaps between data analysis experts and ordinary data analysts.
Data mining is mainly divided into three categories: classification algorithm, clustering algorithm and association rules, which basically cover all the requirements of algorithms in the current commercial market. And there are many classical algorithms in these three categories. A lot of data mining algorithms on the market are abstruse and difficult to understand. Today, WE will introduce the ten classic algorithms of data mining in simple plain English to help you quickly understand.
Classification algorithm
Link analysis: PageRank
Association analysis: Apriori
Classification algorithms: C4.5, Naive Bayes, SVM, KNN, Adaboost, CART
Clustering algorithm: K-means, EM
A, PageRank
When a paper is cited more times, it proves that the paper is more influential.
The more the entrance of a web page, the higher the quality of the chain, the higher the quality of the web page.
The principle of
Page influence = damped influence + sum of weighted influence of all pages in linked set
- Influence of a page: the sum of the weighted influence of all linked pages.
- The influence contribution of a web page to other web pages is: its own influence/the number of links.
- Users do not always jump to the web, but there are other ways to access the web, such as by typing in a url.
- Therefore, the damping factor needs to be set, which represents the probability of users accessing the Internet according to the jump link.
Metaphor description
1, weibo
The number of followers of a person’s micro blog is not necessarily equal to his actual influence, but also depends on the quality of followers.
It doesn’t work if it’s a zombie fan, but if it’s followed by a lot of big Vs or celebrities, it’s very influential.
2. Store operation
More customers of the store quality is better, but to see if the customer is not.
3, interests,
They spend a lot of time on people or things they are interested in, and they spend a lot of time on people or things they are interested in. The more attention that person or thing gets, the greater its influence/audience becomes.
About the damping factor
1. Judge your influence by the influence of your neighbors, but if you can’t access you through your neighbors, it doesn’t mean you don’t have influence, because you can directly access you, so introduce the concept of damping factor.
In addition to rivers flowing through the ocean, there is rain, but the rain is random.
3, put forward the damping coefficient, or in order to solve some websites clearly exist a large number of chain (into the chain), but the influence is very large.
- Out chain example: hao123 navigation page, out of the chain very few into the chain.
- Into the chain example: Baidu Google and other search engines, into the chain very much out of the chain very little.
Ii. Apriori (Association analysis)
Correlation mining is to discover the correlation between commodities from consumer transaction records.
The principle of
1. Support
The ratio of the number of occurrences of a particular commodity combination to the total number of occurrences.
5 purchases, 4 milk purchases, milk support is 4/5=0.8.
5 times, 3 times buy milk + bread, milk + bread support is 3/5=0.6.
2. The confidence level
If I buy good A, what’s the probability that I buy good B, what’s the probability that B will happen if A happens.
Milk was bought 4 times, and beer was bought 2 times. (Milk -> beer) has a confidence of 2/4=0.5.
Bought beer 3 times, bought milk 2 times, (beer -> milk) confidence is 2/3-0.67.
3. Improve degrees
Measure how much the appearance of good A increases the probability of good B.
Degree of improvement (A->B)= confidence (A->B)/ support (B).
Promotion degree >1, there is promotion; Lifting degree =1, no change; Lift <1, down.
4. Frequent item sets
Item set: Can be a single item or a combination of items.
A frequent item set is an item set whose Support is greater than Min Support.
The calculation process
1. Filter frequent item sets starting from K=1.
2. In the results, the K+1 item set is combined and screened again.
3. Cycle 1,2 steps. The result of the k-1 term set is the final result until no result is found.
Extension: FP-growth algorithm
Apriori algorithm needs to scan the database for many times, which has low performance and is not suitable for large data volume.
In the FP-growth algorithm, data is stored in the FP tree by constructing the data structure of the FP tree. The database is scanned twice during the construction of the FP tree, and the subsequent processing does not need to access the database.
Beer and diapers are sold together
Wal-mart found through data analysis that in American families with babies, the mother usually stays at home to take care of the children, while the father goes to the supermarket to buy diapers.
When my father buys diapers, he often treats himself with a few bottles of beer, so the supermarket tries to promote beer and diapers together, which leads to a huge increase in sales of both diapers and beer.
Third, AdaBoost
The principle of
Simply put, multiple weak classifiers train to become one strong classifier.
A series of weak classifiers with different weight ratios are selected as the final classification.
The calculation process
1. Initialize the base weights.
2. Prize weight matrix, calculate the error rate through the classifiers, and select the classifier with the lowest error rate as the optimal classifier.
3. Through the classifier weight formula, the correct sample distribution is reduced and the wrong sample distribution is increased to obtain the new weight matrix and the classifier weight of the current K round.
4. Put the new weight matrix into steps 2 and 3 above and recalculate the weight matrix.
5. Iterate N rounds and record the final classifier weight of each round to obtain a strong classifier.
Metaphor description
1, use false questions to improve learning efficiency
Do the right ones, do less next time, but you’ll get it.
Do more of the wrong questions next time and focus on the wrong questions.
As you learn more, you’ll get fewer and fewer wrong questions.
2, reasonable crossover to improve profits
Apple, with its combination of soft and hard, accounts for most of the profits in the mobile phone market, and the knowledge of the two fields combines to generate new revenue.
Iv. C4.5 (Decision Tree)
Decision making is the process of choosing an answer to a question that has multiple answers.
C4.5 algorithm is used to generate decision tree algorithm, mainly used for classification.
C4.5 uses information gain rate for calculation (ID3 algorithm uses information gain for calculation).
The principle of
C4.5 Select the most effective way to split the sample set, and the splitting rule is to analyze the information gain rate of all attributes.
The higher the information gain rate is, the stronger the classification ability of this feature is, so we need to prioritize this feature for classification.
Metaphorical explanation: pick watermelon.
If it’s clear, it’s a good melon. If it’s a little fuzzy, consider its density. If its density is above a certain value, consider it a good melon.
V. CART (Decision Tree)
CART: Classification And Regression Tree, Chinese called Classification Regression Tree, that is, can do Classification And Regression.
What are classification trees and regression trees?
Classification tree: Processing discrete data, that is, data of limited types, output is the category of the sample.
Regression tree: can predict the continuous type of values, the output is a value, value in a certain interval has the possibility of value.
The essence of regression problem and classification problem is the same, is to make an output prediction for an input, the difference lies in the type of output variables.
The principle of
CART classification tree
It is similar to the C4.5 algorithm, except that the index of attribute selection is gini coefficient.
The Gini coefficient reflects the uncertainty of samples. The smaller the Gini coefficient is, the smaller the difference between samples is and the lower the uncertainty degree is.
Classification is a process of uncertainty reduction. When constructing classification tree, CART will select the attributes with the lowest Gini coefficient as the classification of attributes.
CART regression tree
Taking the mean square error or absolute value error as the standard, the feature with the smallest mean square error or absolute value error is selected.
Metaphor description
Classification: Predict whether it will be cloudy, sunny or rainy tomorrow.
Regression: Predict what the temperature will be tomorrow.
6. Naive Bayes (Conditional Probability)
Naive Bayes is a simple and effective common classification algorithm, which calculates the probability of each category under the condition of unknown objects, and selects the classification with the highest probability.
The principle of
Assume that the input is independent between different characteristics, based on the principle of probability theory, through the prior probability P (A) and P (B) and determine the conditional probability and probability P (A | B).
P(A) : prior probability, that is, A judgment on the probability of A event before the occurrence of B event.
P (B | A) : conditional probability, the event B in A separate incident has happened under the condition of probability.
P (A | B) : A posteriori probability, namely in B, review of A probability of occurrence.
Metaphorical description: classification of patients.
Given a new patient, a construction worker who sneezes, calculate the probability that he will catch a cold.
Seven, the SVM
SVM: Support Vector Machine (SVM) is a common classification method. It was originally designed for binary classification problems. In Machine learning, SVM is a supervised learning model.
What is supervised learning and unsupervised learning?
Supervised learning: Classifying sample data in the presence of category labels.
Unsupervised learning: in the absence of category labels, sample data are classified according to certain methods, namely clustering. The classified categories need to be further analyzed to learn the characteristics of each category.
The principle of
Find the sample points with the smallest spacing, and then fit a line segment/plane with the maximum distance from these sample points.
Hard interval: the case where the data is linearly distributed, the classification is given directly.
Soft interval: allows a certain amount of sample classification errors.
Kernel function: Data of nonlinear distribution is mapped to data of linear distribution.
Metaphor description
1. Separate a pile of red balls from basketballs on the table
Use a piece of string to divide the red and blue balls on the table.
2. A stack of red balls and basketballs in a compartmentalized box
Divide the red and blue balls in the box into two parts using a flat surface.
Viii. KNN (Clustering)
One of the most basic and simplest of machine learning algorithms, it is capable of both classification and regression, and classifies by measuring distances between different eigenvalues.
The principle of
Calculate the distance between the object to be classified and other objects. For K nearest neighbors, the category with the largest number is predicted to be the category of the classified object.
Calculation steps
1. According to the scene, select the distance calculation method to calculate the distance between the object to be classified and other objects.
2. Collect statistics of K neighbors closest to each other.
3. For K nearest neighbors, the category with the largest number of neighbors is predicted to be the category of this classification object.
Metaphor explanation: keep company with you, and you will get red.
Ix. K-means (Clustering)
K-means is a clustering algorithm, which is unsupervised learning. It generates specified K classes and assigns each object to the nearest cluster center.
The principle of
1. Randomly select K points as classification centers.
2. Assign each point to the nearest class, thus forming K classes.
3. Recalculate the center point of each class. For example, if there are 10 points in the same category, then the new center is going to be the center of those 10 points, and a simple way to do that is to take the average.
Metaphor description
1. Choose the boss
You randomly pick K leaders, and whoever is close to you is the person in that queue (calculate the distance, the people who are close together).
Over time, the boss’s position changes (according to the algorithm, the center point is recalculated) until the true central boss is selected (repeated until the accuracy is highest).
2. Differences between Kmeans and Knn
Kmeans starts classes to choose the boss, and turns the tables until the best central boss is chosen.
Knn boys plus team, relatively close to that class, that class.
X. EM (Clustering)
EM is the Expectation Maximization, so THE EM algorithm is also called the maximum Expectation algorithm, is also a clustering algorithm.
The difference between EM and K-means:
- EM is to calculate probability, KMeans is to calculate distance.
- EM belongs to soft clustering, and the same sample may belong to multiple categories. However, K-means belongs to hard clustering, and a sample can only belong to one category. So the former can find some hidden data.
The principle of
First estimate a high probability of the possible parameters, and then based on the data continuously adjust until the final confirmation parameters.
Metaphor description: weighing vegetables.
Few people use a scale to weigh a dish and then divide it by half.
What most people do is:
1. Divide some of it into dish A and the rest into dish B.
2. Observe whether there are as many dishes in plate A and plate B, and put some of the dishes in the smaller plate.
3. Then check to see if there is the same amount in plate A and plate B, and repeat until there is no change.
10 algorithms have been said, in fact, generally speaking, commonly used algorithms have been encapsulated in the library, as long as the corresponding model can be new.