preface

Recently, I have been engaged in the research and work related to machine learning and data mining. I wrote this series of articles for the purpose of summarizing and organizing knowledge. If you are not familiar with Python, you can refer to another series of articles I wrote about the basics of Data mining Python, or follow the public account QStack, which has all the articles and learning materials.

What is data mining

In my opinion, KDD is a good summary and interpretation of data mining. KDD is Knowledge Discovery in Database

preface

Recently, I have been engaged in the research and work related to machine learning and data mining. I wrote this series of articles for the purpose of summarizing and organizing knowledge. If you are not familiar with Python, you can refer to another series of articles I wrote about the basics of Data mining Python, or follow the public account QStack, which has all the articles and learning materials.

What is data mining

In my opinion, KDD is a good summary and interpretation of data mining. KDD is also known as Knowledge Discovery in Database, which means to discover Knowledge from data. It is often said that we are now drowning in the sea of data but lack of information, that is, lack of Knowledge. The purpose of data mining is to use data mining technology to find interesting “knowledge” and “patterns” from the ocean of data.

Data mining related technologies

We have mentioned above that data mining is to find interesting knowledge and patterns in massive data, so how to do that? This involves some concepts and technologies of data mining. Let’s take a look at some specific data mining technologies and the application of these technologies.

Frequent patterns and correlation analysis

Frequent patterns, literally, are frequent patterns, and relevance is the connection between things. This concept may seem abstract, but we see it all the time, most famously in the case of beer and diapers. Beer and diapers don’t seem to go together at first glance, but as mentioned above, data mining is all about finding interesting patterns, which is part of its appeal. Through a large amount of data analysis and mining, researchers found that customers who buy diapers always buy beer, so they sold beer and diapers at the same place. As a result, the sales of beer and diapers were greatly improved.

In this case, beer and diapers frequently appear in people’s shopping baskets, which is the frequent pattern. In frequent mode, we need to pay attention to two indicators, one is support, one is confidence. In this case, it is the number of users who buy beer and diapers at the same time divided by the total number of users. If the support degree is too small, it indicates that this pattern may be a special case and the mining value is not high. The confidence level in this case is the number of customers who buy beer and diapers divided by the number of people who buy diapers, which indicates the correlation between the two. The higher the confidence level, the stronger the correlation between the two, the higher the credibility of the model.

This is the primary application of data mining in business, the discovery of a pattern may be behind millions of profits, which is also the reason why companies pay more and more attention to data mining. There are many algorithms for frequent pattern mining, and the specific ideas and implementation will be introduced in the following articles.

classification

Classification is to use the training data set to find the model that distinguishes and describes the data, and then use this model to predict the class label of unknown data. Concepts are abstract, we commonly used in the classification of decision tree to explain the process simple, such as the clustering with the customers, we need for the same type of customer interest may be similar, A user to buy the goods A, and belong to the same class of user A has great chance is also interested in A commodity, So we can recommend product A to others in the category, which is much more efficient than aimless marketing. For the simplicity of the example, we only consider two attributes, age and income. The classification process is shown in the following figure. Using these two attributes we can classify users into A, B, and C. When A new user comes, we know that his age is youth and income is high, so we can classify him as class A and then carry out the following operations, such as recommending goods or others. The purpose of classification technology is to construct such a model and then use this model to predict the categories of unknown data.

clustering

In the classification technology mentioned above, the categories of training data are known. What is needed for classification is to find out relevant conditions to better distinguish these categories and minimize the probability of misclassification. In clustering, there is no known class label, or the purpose of clustering is to find the class label. As shown in the figure, we do not know the categories of these points before we start, but we can cluster these points into three classes by k-nearest neighbor, as shown in the circle. In real life, these points are users, and users can be grouped by clustering technology.


Outlier detection

The above clustering technique allows us to group similar users together. Birds of a feather flock together. However, there are some users who are very different from others, as shown in the red dot below. We call this red dot an outlier. There are also many studies on outliers, such as credit card fraud, where consumer behavior is very different from that of most users, which is called outliers. Therefore, outlier research has been widely applied in some risk control.

The last

Like is the biggest support, more articles and learning materials can follow the wechat public number QStack.

This article is formatted using MDNICE