Introduction to data mining and machine learning
What is data mining? Data mining refers to the corresponding processing and analysis of some existing data, and finally get the deep relationship between data. For example, milk sold more when it was placed with bread or with other items in a supermarket. Data mining technology can be used to solve this kind of problem. Specifically, the supermarket goods placement problem can be divided into the scenario of association analysis.
In daily life, data mining technology is widely used. For example, merchants often need to divide their customer levels (SVIP, VIP, ordinary customers, etc.). At this time, part of customer data can be used as training data, and the other part of customer data can be used as test data. Then the training data is input into the model for training. After the training is completed, another part of data is input for testing, and finally the automatic division of customer levels is realized. Other examples of similar applications include captcha recognition and automatic fruit quality screening.
So what is machine learning technology? In short, machine learning is the technology that allows machines to learn the relationships or rules between data through models and algorithms that we build. In fact, machine learning technology is an interdisciplinary subject, which can be roughly divided into two categories: traditional machine learning technology and deep learning technology, among which deep learning technology includes neural network related technology. In this course, we will focus on traditional machine learning techniques and algorithms.
Machine learning technology and data mining technology are often mentioned together because they both explore the laws between data. These two technologies also have very broad application scenarios in real life, among which several classic application scenarios are shown in the figure below:
1. Classification: classification of customer grades, identification of verification code, automatic screening of fruit quality, etc
Machine learning and data mining techniques can be used to solve classification problems, such as customer classification, verification code recognition, automatic screening of fruit quality, etc.
Taking the verification code recognition as an example, it is necessary to design a scheme to recognize the verification code composed of handwritten digits from 0 to 9. One solution is to first divide some handwritten digits from 0 to 9 into training sets, and then divide the training set manually. That is, each handwritten digit is mapped to its corresponding numeric category. After establishing these mapping relations, the corresponding model can be established through classification algorithm. At this time, if a new numeric script appears, the model can predict the number represented by the script, that is, which numeric category it belongs to. For example, if the model predicts that a script belongs to the category of number 1, the script can be automatically recognized as number 1. So the problem of verification code recognition is essentially a classification problem.
The automatic screening of fruit quality is also a classification problem. The size, color and other characteristics of a fruit can also be mapped to the corresponding sweetness category, for example, 1 can represent sweet and 0 can represent unsweet. After obtaining the data of some training sets, the model can also be established through classification algorithm. At this time, if a new fruit appears, it can automatically judge whether it is sweet or not by its size, color and other characteristics. In this way, the fruit quality can be automatically screened.
2, regression: continuous data prediction, trend prediction, etc
In addition to classification, data mining and machine learning technologies have a very classic scenario — regression. In the classification scenarios mentioned above, there is a limit to the number of categories. For example, in the scene of digital verification code recognition, the number categories from 0 to 9 are included; Another example is the letter captcha recognition scenario, which contains a limited number of categories from A to Z. Both numeric and alphabetic categories have a limited number of categories.
Now suppose there is some data that, after mapping, the best result does not fall at a 0, 1, or 2, but at 1.2, 1.3, 1.4… The above. And classification algorithm can not solve this kind of problem, at this time can adopt regression analysis algorithm to solve. In practical application, regression analysis algorithm can realize continuous data prediction and trend prediction.
3. Clustering: customer value prediction, business circle prediction, etc
What is clustering? As mentioned above, historical data (i.e., artificially correct training data) is necessary to solve the classification problem. If there is no historical data and the feature of an object needs to be directly divided into its corresponding category, classification algorithm and regression algorithm cannot solve this problem. In this case, there is a solution — clustering. Clustering directly divides the corresponding categories according to the characteristics of objects. It does not need training, so it is an unsupervised learning method.
When can clustering be used? If there is characteristic data of a group of customers in the database, now it is necessary to divide the customer level directly according to the characteristics of these customers (such as SVIP customers, VIP customers), at this time, clustering model can be used to solve the problem. In addition, clustering algorithm can also be used to predict business circles.
4. Correlation analysis: supermarket product placement, personalized recommendation, etc
Association analysis refers to the analysis of the correlation between items. For example, a supermarket stores a large number of goods, and now it is necessary to analyze the correlation between these goods, such as the strength of the correlation between bread goods and milk goods, at this time, the association analysis algorithm can be used to directly analyze the correlation between these goods with the help of users’ purchase records and other information. After understanding the relevance of these goods, it can be applied to the placement of goods in the supermarket. By placing goods with strong relevance in similar positions, the sales of goods in the supermarket can be effectively increased.
In addition, association analysis can also be used for personalized recommendation technology. For example, with the help of the browsing history of users, the correlation between various web pages can be analyzed, and the strongly correlated web pages can be pushed to users when they browse web pages. For example, after analyzing browsing record data, it is found that there is A strong correlation between web page A and web page C, so web page C can be pushed to A user when browsing web page A, thus realizing personalized recommendation.
5. Natural language processing: text similarity technology, chatbot, etc
In addition to the above application scenarios, data mining and machine learning techniques can also be used for natural language processing and speech processing. Examples are text similarity calculation and chatbots.
2. Python data preprocessing
Before data mining and machine learning, the first step to do is to preprocess the existing data. If even the initial data are incorrect, there is no guarantee that the final result is correct. Only preprocessing the data to ensure its accuracy can ensure the correctness of the final results
Data preprocessing refers to the preliminary processing of data, and the dirty data (i.e. the data that affects the accuracy of the result) is removed, otherwise the final result may be easily affected. Common data preprocessing methods are shown in the figure below:
1. Missing value processing
A missing value is an eigenvalue missing from a row in a set of data. There are two ways to solve the missing value. One is to delete the row of data where the missing value is located, and the other is to add a correct value to the missing value.
2. Outlier processing
The cause of outliers is usually that errors occur in data collection, for example, when the number 68 is collected, it is mistakenly collected as 680. Before dealing with outliers, it is natural to find outlier data, which can often be found by drawing them. After the processing of outlier data is completed, the original data will tend to be correct to ensure the accuracy of final results.
3. Data integration
Compared with missing value processing and outlier processing above, data integration is a relatively simple way of data preprocessing. So what is data integration? Given that two sets of data A and B have been loaded into memory, the user can use Pandas to merge the two sets of data into A single set. The merging process is essentially an integration process.
Next to Taobao commodity data as an example, introduce the actual pretreatment above.
Before data pretreatment, taobao commodity data needs to be imported from MySQL database. After starting the MySQL database, query the TAOB table in it and get the following output:
As you can see, the TAOB table has four fields. The title field is used to store the name of Taobao products; Link field stores the link of Taobao products; Price Stores the price of goods on Taobao; Comment Stores the number of comments on Taobao products (representing the sales volume of products to some extent).
So how do you import this data in? To connect to the database, use pymysql to retrieve all the data in TAOB, and import the data into memory using the read_sql() method in pandas.
The read_sql() method takes two parameters. The first parameter is the SQL statement and the second parameter is the connection information for the MySQL database. The specific code is shown below:
1, missing value processing actual combat
Missing values can be processed by data cleaning. For example, the number of comments on a certain product may be 0, but its price cannot be 0. However, there are actually some data with a price value of 0 in the database. This happens because the price attribute of some data is not climbed.
So how can you tell if there are missing values in the data? It can be distinguished by the following methods:
First, call the data.describe() method on the previous TAOB table, and you will see something like the following:
How to make sense of this statistic? The first step is to observe the count data of price and comment fields. If they are not equal, there must be information missing. If the two are equal, there is no way to see whether there is a missing condition. For example, if the count of price is 9616.0000 and that of comment is 9615.0000, at least one comment data is missing.
The meanings of other fields are as follows: mean stands for average; STD stands for standard deviation; Min stands for minimum value; Max indicates the maximum value.
So how to deal with the missing data? One way is to delete the data, and another way is to insert a new value where the missing value is. The value in the second method can be either the average or the median, depending on the actual situation. For example, the data of age (1 to 100 years old), which is stable and varies with a small degree of difference, is generally inserted as the average, while the data with a large change interval is generally inserted as the median.
To handle the missing value of a price, do the following:
2. Practice of outlier processing
Similar to missing values, to handle outliers, you first need to find outliers. The discovery of outliers is usually done by drawing a scatter plot, because similar data will be concentrated in one area in the scatter plot, while the anomalous data will be distributed far away from this area. According to this property, it is easy to find outliers in the data. The specific operation is shown below:
The first step is to pull out the price data and review data from the data. The usual method can be extracted by means of a loop, but this method is too complicated. A simple method is to transpose the data box, so that the original column data becomes the current row data, which can be very convenient to obtain the price data and comment data. The scatter plot is then plotted using the plot() method, where the first parameter represents the abscissa, the second parameter represents the ordinate, the third parameter represents the type of the plot, and “o” represents the scatter plot. Finally, show() method is used to show them, so that outliers can be intuitively observed. These outliers are not helpful for data analysis. In practice, data represented by these outliers often need to be deleted or converted to normal values. The following is a scatter plot:
As shown in the figure above, you can handle outliers by processing all data with comments greater than 100,000 and prices greater than 1,000. The realization process of the two methods is as follows:
The first is to change it to the median, the mean, or some other value. The specific operation is shown in the figure below:
The second method is deleting the abnormal data directly, which is also recommended. The specific operation is shown in the figure below:
3. Distribution analysis
Distribution analysis refers to the analysis of the distribution of data, that is, to observe whether it is linear distribution or normal distribution. The method of drawing histogram is generally used for distribution analysis. Histogram drawing has the following steps: calculating range, calculating group distance and drawing histogram. The specific operation is shown in the figure below:
Arrange () method is used to formulate the style. The first parameter of arrange() method represents the minimum value, the second parameter represents the maximum value, and the third parameter represents the group distance. Hist () method is then used to draw the histogram.
The histogram of Taobao commodity prices in the TAOB table is shown in the figure below, roughly in line with normal distribution:
The histogram of taobao product reviews in the TAOB table is shown in the figure below, which shows a decreasing curve roughly:
4. Word cloud map drawing
Sometimes it is often necessary to draw word cloud map according to a piece of text information, and the specific operation of drawing is shown as follows:
The general process of implementation is as follows: First, cut() is used to cut the words in the document. After cutting the words, these words are arranged into fixed formats. Then, corresponding pictures are read according to the presentation form of the required WordCloud(the WordCloud in the picture below is a cat shape), and then the WordCloud is converted using wc.wordcloud (). Finally, imshow() is used to display the corresponding word cloud. For example, the effect of word cloud map drawn according to laojiumen.txt document is as follows:
Iii. Introduction of common classification algorithms
There are many common classification algorithms, as shown in the figure below:
Among them, KNN algorithm and Bayesian algorithm are more important algorithms, in addition to some other algorithms, such as decision tree algorithm, logistic regression algorithm and SVM algorithm. Adaboost algorithm is mainly used to transform weak classification algorithm into strong classification algorithm.
Four, classification of iris case actual combat
If there are some data of iris, these data contain some characteristics of iris, such as petal length, petal width, calyx length and calyx width. With these historical data, we can use these data to train the classification model. After the model training, when a new iris of unknown type appears, we can use the trained model to determine the type of this iris. There are different ways to implement this case, but which classification algorithm is better?
1. KNN algorithm
Introduction to KNN algorithm:
First of all, there are three types of taobao products mentioned above, namely snacks, brand-name bags and electrical appliances, all of which have two features: Price and comment. In order of price, designer bags are the most expensive, followed by electrical appliances and snacks. In order of the number of comments, snacks received the most comments, followed by electrical appliances and designer bags. Then, a cartesian coordinate system is established with Price as X-axis and comment as Y-axis, and the distribution of these three commodities is plotted in the coordinate system, as shown in the figure below:
It is obvious that these three types of goods are concentrated in different regions. If a new product with known characteristics appears, use? Represents this new product. According to its characteristics, the position of the commodity in the coordinate system mapping is shown in the figure. Which of the three types of commodities is the commodity most likely to be asked?
This kind of problem can be solved by KNN algorithm. The realization idea of this algorithm is to calculate the sum of Euclidean distances of unknown goods from other goods and then sort them. The smaller the sum of distances is, the more similar the unknown goods are to this kind of goods. For example, after calculation, it is concluded that the sum of Euclidean distances between the unknown commodity and electrical goods is the smallest, then the commodity can be considered as electrical goods.
Implementation method:
The concrete implementation of the above process is as follows:
Of course, it can also be switched directly, which is more simple and convenient. The disadvantage is that users cannot understand its principle:
KNN algorithm is used to solve the classification problem of iris:
First, load iris data. Specifically, there are two loading schemes. One is to read directly from the iris data set. After the path is set, the read_CSV () method is used to read and separate the features and results of the data set. The specific operations are as follows:
Another way to load is through Sklearn. Sklearn has its own iris dataset in datasets. By using load_iris() method of DATASets, the data can be loaded, and then features and categories can be obtained as well, and then training data and test data can be separated (generally cross-validation). Specifically, the train_test_split() method is used for separation, where the third parameter represents the test ratio and the fourth parameter is the random seed. The specific operation is as follows:
After the loading is completed, the KNN algorithm mentioned above can be called for classification.
2. Bayesian algorithm
Introduction to Bayesian algorithm:
First introduces simple bayesian formula: P (B | A) = P (A | B) P (B)/P (A). If there are some data of courses, as shown in the table below, the price and class hours are the characteristics of courses, while the sales volume is the result of courses. If a new course has a high price and many class hours, the sales volume of the new course can be predicted according to the existing data.
Price (A) |
Number of periods (B) |
Sales volume (C) |
low |
more |
high |
high |
In the |
high |
low |
less |
high |
low |
In the |
low |
In the |
In the |
In the |
high |
more |
high |
low |
less |
In the |
This is obviously a classification problem. First, the table is processed, and feature 1 and feature 2 are converted into numbers, that is, 0 represents low, 1 represents medium, and 2 represents high. After the digitize, [[t1, t2], [t1, t2], [t1, t2]] — — — — — – [[0, 2], [2, 1], [0, 0]], and then to transpose the two-dimensional list (to facilitate subsequent statistics), [[t1, t1, t1], [t2, t2, t2]] — — — — — — — [[0, 0], [2, 0]]. [0,2,0] represents the price of each course, and [2,1,0] represents the number of class hours of each course.
The original problem can be equivalent to finding the probability of high, medium and low sales of new courses under the condition of high price and many class hours. The P (C | AB) = P (AB | C) P (C)/P (AB) = P (A | C) P (B | C) P (C)/P (AB) =”
Apparently P (c0 | AB) is the largest, to forecast sales of the new class is high.
P (A | C) P (B | C) P (C), of which there are at least three types of C: c0 = high, c1 =, c2 = low. And eventually need is P (c0 | AB), P (c1 | AB) and P (c2 | AB) the size of the three, P (c0 | AB) = P (A | c0) P (B | c0) P (c0) by 2/4 = 2/4 * * 4/7 = 1/7 P (c1 | AB) = P (A | c1) P (B | c1) P (c1) = 0 = 0, P (c2 | AB) = P (A | c2) P (B | c2) P (c2) = 0 = 0Copy the code
Implementation method:
Like KNN algorithm, Bayesian algorithm also has two ways of implementation, one is detailed implementation:
The other is the implementation of integration:
3. Decision tree algorithm
Decision tree algorithm is based on the theory of information entropy to achieve, the algorithm is divided into the following steps:
Decision tree means that for data with multiple features, for the first feature, whether considering this feature (0 means not considering it, 1 means considering it) will form a binary tree, and then consider the second feature as well… Until all the features are considered and a decision tree is formed. Here is a decision tree:
The realization process of decision tree algorithm is as follows: First, take out the categories of data, and then transform the description of data (for example, transform “yes” into 1 and “no” into 0), establish a decision tree with the help of the DecisionTreeClassifier in SkLearn, and use fit() method for data training. After the training, predict() was directly used to obtain the prediction results. Finally, export_graphviz was used to visualize the decision tree. The specific implementation process is shown in the figure below:
4. Logistic regression algorithm
Logistic regression algorithm is based on the principle of linear regression to achieve. Suppose there is a linear regression function: y=a1x1+a2x2+a3x3+… +anxn+b, where x1 to xn represent each feature. Although this line can be used to fit it, its robustness is poor due to the large range of Y. To achieve classification, the scope of y needs to be reduced to a certain space, such as [0,1]. At this time, the range of Y can be reduced by substitution method:
Let y=ln (p/ (1-p)) then: E ^ ^ y = e (ln (p)/(1 - p)) = > ^ y = p/e (1 - p) = > e ^ y * (1 - p) = p = > e ^ ^ y - p * e y = p = > e ^ y = p (1 + e ^ y) = > p = e ^ y/(1 + e ^ y) = > p belong to [0, 1]Copy the code
In this way, the range of Y is reduced to achieve accurate classification and logistic regression.
The corresponding implementation process of logistic regression algorithm is shown in the figure below:
5. SVM algorithm
SVM algorithm is an accurate classification algorithm, but its interpretability is not strong. It can transform the problem of linear inseparability in lower dimensional space into linear separability in higher dimensional space. The use of SVM algorithm is very simple, directly import SVC, and then train the model, and make prediction. Specific operations are as follows:
Although the implementation is very simple, the key of the algorithm is how to choose the kernel function. Kernel functions can be divided into the following categories, and each kernel is also suitable for different situations:
For data that is not particularly complex, linear or polynomial kernels can be used. For complex data, radial basis kernel function is used. The image drawn by each kernel function is as follows:
5. Adaboost algorithm
Suppose there is an algorithm of single layer decision tree, it is a weak classification algorithm (algorithm with very low accuracy). If you want to strengthen the weak classifier, you can use the idea of Boost to achieve, such as the use of Adaboost algorithm, which is to carry out multiple iterations, give different weight each time, calculate the error rate and adjust the weight at the same time, and finally form a comprehensive result.
Adaboost algorithms are generally not used alone, but in combination, to strengthen algorithms that are weak in classification.
Five, the selection of classification algorithm ideas and skills
First, let’s see if it’s a dichotomous problem or a multi-classification problem. If it’s a dichotomous problem, generally these algorithms can be used; If it is a multi-classification problem, KNN and Bayesian algorithms can be used. Secondly, whether high interpretability is required, if high interpretability is required, then SVM algorithm cannot be used. Then look at the number of training samples and the number of training samples. If the number of training samples is too large, KNN algorithm is not suitable. Finally, see whether it is necessary to carry out weak-strong algorithm transformation, if necessary, use Adaboost algorithm, otherwise do not use Adaboost algorithm. If in doubt, select some data for validation and model evaluation (time-consuming and accurate).
The original article was published on April 5, 2018
Author: Wei Wei
This article is from the cloud community partner “Datapai THU”. For relevant information, you can follow the wechat public account of “Datapai THU”