With the rapid development of database technology and the wide application of database management system, people accumulate more and more data. There is a lot of important information behind the explosion of data, and people want to be able to analyze it at a higher level to make better use of this data. The current database system can efficiently realize data input, query, statistics and other functions, but can not find the relationship and rules in the data, can not predict the future development trend according to the existing data. The lack of means to mine the knowledge hidden behind data has led to the phenomenon of “data explosion but knowledge poverty”.
1. Concept of data mining
DataMining technology is the result of long-term research and development of database technology. At first, all kinds of business data were stored in the database of the computer, then developed to the database can be queried and accessed, and then developed to the real-time traversal of the database. Data mining makes database technology enter a more advanced stage, it can not only query and traverse past data, but also find out the potential relationship between past data, so as to promote the transmission of information. Data mining is now ready for commercial use because the three basic technologies that support it have matured: massive data collection, powerful multiprocessor computers, and data mining algorithms.
From a technical point of view, data mining is a process of extracting hidden, unknown but potentially useful information and knowledge from a large number of incomplete, noisy, fuzzy and random actual application data. This definition has several implications: the data source must be real, massive, and noisy; Discover the knowledge that the user is interested in; The knowledge discovered should be acceptable, understandable, and usable; It does not require the discovery of universal knowledge, but only supports the discovery of specific problems.
There are many similar terms, such as knowledge discovery from databases, data analytics, DataFusion, and decision support.
What is knowledge? In a broad sense, data and information are also forms of knowledge, but people regard concepts, rules, patterns, laws and constraints as knowledge. Raw data can be structured, such as data in a relational database; It can also be semi-structured, such as text, graphics, and image data; Even heterogeneous data distributed across the network. The method of discovering knowledge can be mathematical or non-mathematical; It can be deductive, it can be inductive. The discovered knowledge can be used for information management, query optimization, decision support and process control, as well as for the maintenance of the data itself. Therefore, data mining is an interdisciplinary subject, which elevates people’s application of data from low-level simple query to mining knowledge from data and providing decision support. Under this demand, researchers from different fields, especially scholars and engineering technicians from database technology, artificial intelligence technology, mathematical statistics, visualization technology, parallel computing and other aspects, are gathered together to devote themselves to the emerging research field of data mining, forming a new technical hotspot.
From the point of view of business, data mining is a new business information processing technology. Its main characteristics are to extract, transform, analyze and model a large number of business data in the business database, and extract the key data to assist business decision.
In short, data mining is actually a deep data analysis method. Data analysis itself has been around for many years, but in the past, data collection and analysis were used for scientific research. In addition, due to the limitations of computing power at that time, complex data analysis methods for analyzing large amounts of data were greatly limited. Now, thanks to the automation of businesses across industries, the business world is generating vast amounts of business data that is no longer collected for analytical purposes, but generated as a result of purely opportunistic business operations. The analysis of these data is no longer purely for research needs, but mainly for business decisions to provide truly valuable information, and then profit. However, a common problem faced by all enterprises is that the amount of enterprise data is very large, but there is very little information of real value. Therefore, it is like panning gold from ore to obtain information beneficial to business operation and improve competitiveness through deep analysis from a large amount of data. Therefore, data mining is also named.
Therefore, data mining can be described as an advanced and effective method to explore and analyze a large amount of enterprise data according to the established business objectives of the enterprise, reveal hidden, unknown or verify known regularities, and further model them.
The essential difference between data mining and traditional data analysis (such as query, report form, online application analysis) is that data mining is to mine information and discover knowledge without clear assumptions. The information obtained from data mining should be prophetic, effective and practical.
Previously unknown information means that the information is not expected in advance, that is, data mining is to discover the information or knowledge that cannot be found by intuition, or even counterintuitive information or knowledge. The more unexpected the excavated information is, the more valuable it may be. The most typical example of a business application is a chain store that found a surprising link between diapers and beer through data mining.
In particular, data mining has been application-oriented since its inception. It is not only a simple retrieval query call for a specific database, but also to carry out micro, medium and even macro statistics, analysis, synthesis and reasoning on these data to guide the solution of practical problems, in an attempt to find the correlation between events, and even use the existing data to predict future activities. For example, BC telephone company in Canada asked SimonFraser university knowledge discovery research group to summarize, analyze and propose new telephone charging and management methods based on its customer data of more than 10 years, and formulate preferential policies that are beneficial to both the company and customers. In this way, the human application of data, from low-level terminal query operations, to provide decision support for decision makers at all levels of operation. This demand driver is more powerful than database queries.
2. Data mining functions
Data mining makes proactive, knowledge-based decisions by predicting future trends and behaviors. The goal of data mining is to find hidden and meaningful knowledge from the database, which mainly has the following five functions.
Proactive Behavior means taking control of the situation rather than being overwhelmed by it when you’re facing a difficult situation. Proactive thoughts and actions (factors in the former influence the latter); 1. Forward-looking or taking a step ahead; Proactive behavior brings the situation under immediate control.
(1) Automatic trend prediction and behavior data mining
Automatically finding predictive information in large databases, problems that previously required a lot of manual analysis can now be quickly and directly resolved from the data itself. A typical example is the market forecasting problem, where data mining uses past data on promotions to find users with the highest returns on future investments. Other predictable problems include predicting bankruptcies and identifying the group most likely to respond to a given event.
(2) Association analysis
Data association is an important discoverable knowledge in database. If there is some regularity between the values of two or more variables, it is called association. Correlation can be divided into simple correlation, temporal correlation and causal correlation. The purpose of association analysis is to find the hidden connections in the database. Sometimes the association function of the data in the database is not known, and even if it is known, it is uncertain, so the rules generated by association analysis have credibility.
(3) Clustering
Records in a clustering database can be divided into a series of meaningful subsets, known as clusters. Clustering enhances people’s understanding of objective reality and is a prerequisite for concept description and bias analysis. Clustering techniques mainly include traditional pattern recognition methods and mathematical taxonomy. In the early 1980s, Mchalski proposed the concept clustering technology and its key points, that is, when dividing objects, we should not only consider the distance between objects, but also require the classification to have a certain connotation description, so as to avoid some one-sidedness of traditional technology.
(4) Concept description
Conceptual description is to describe the connotation of a certain kind of object and summarize the related characteristics of this kind of object. Concept description is divided into characteristic description and distinction description, the former describes the common features of a certain kind of objects, the latter describes the differences between different kinds of objects. Generating a characteristic description of a class involves only the commonalities of all objects in that class. There are many methods to generate differential description, such as decision tree method, genetic algorithm and so on.
(5) Deviation detection
There are often some anomalies recorded in detecting data in a database, and it makes sense to detect these deviations from the database. Bias includes a lot of potential knowledge, such as abnormal instances in classification, special cases that do not meet the rules, deviation between the observed results and the predicted values of the model, and the change of the quantity value with time. The basic method of bias detection is to look for meaningful differences between the observed result and the reference value.
3 Common techniques of data mining
The most common and widely used data mining methods are:
- The decision tree. The mutual information (information gain) in information theory is used to find the attribute with the maximum amount of information in the database, and a node of the decision tree is established, and then branches of the tree are constructed according to different values of the attribute: the process of establishing the lower nodes and branches of the tree repeatedly in each branch subset. The earliest and most influential decision tree method in the world is THE ID3 method studied by Qiulan.
- Neural networks. It simulates the structure of human brain neurons and performs functions such as discrimination, regression and clustering similar to statistics. It is a nonlinear model. There are mainly three kinds of neural network models, namely feedforward network, feedback network and self-organizing network. The biggest advantage of artificial neural network is that it can automatically learn from data and form knowledge, some of which we have not found in the past, so it has strong innovation. The knowledge of neural network is embodied in the weight of network connection, and the learning of neural network is mainly manifested in the gradual calculation of the weight of neural network.
- Genetic algorithm. It is composed of three basic processes, namely reproduction (selection), crossover (recombination) and mutation (mutation). Using genetic algorithm can produce excellent progeny, after several generations of heredity, will meet the requirements of progeny is the solution of the problem.
- Association rule mining algorithm. Association rules are rules that describe the relationship between data. They are generally divided into two steps: first, calculate the big data item set, and then generate association rules with the big data item set.
In addition to the above commonly used methods, there are rough set method, fuzzy set method, nearest neighbor algorithm and so on. No matter which method is adopted to complete data mining, the analysis methods of data mining can be divided into six kinds, namely association analysis, sequence analysis, classification, prediction, cluster analysis and time series analysis.
(1) Association analysis
Association analysis is mainly used to discover the correlation between different events, that is, when one event occurs, another event often occurs. The focus of association analysis is to quickly discover events that have practical relevance. The main basis is that the probability of event occurrence and conditional probability should accord with certain statistical significance.
For structured data, taking the purchasing habits of customers as an example, association analysis can be used to find the associated purchasing needs of customers. For example, a customer who opens a savings account is likely to trade bonds and stocks at the same time, and a male customer who buys diapers often buys beer at the same time. This knowledge can be used to take active marketing strategies to expand the range of products purchased by customers and attract more customers. Adjust the layout of goods so that customers can buy the goods they often buy at the same time, or reduce the price of one product to promote the sale of another, etc.
For unstructured data, taking spatial data as an example, correlation analysis can be used to find the relevance of geographical locations. For example, 85% of large towns near highways are adjacent to water, or find objects that are usually adjacent to golf courses, etc.
(2) Sequence analysis
Sequence analysis techniques are mainly used to discover events that occur consecutively within a certain time interval. These events constitute a sequence, and the sequence discovered should be of universal significance, based on the constraints of time in addition to statistical probability.
(3) Classification analysis
Classification analysis By analyzing the characteristics of samples with categories, to obtain rules or methods for determining which samples belong to various categories. These rules and methods should be used to classify unknown samples with certain accuracy. The main methods include statistical bayesian method, neural network method, decision tree method and supportvectormachines.
Classification technology can be used to classify customers according to their consumption level and basic characteristics, find out the characteristics of important customers who contribute to the greater interests of businesses, and improve their loyalty through personalized service.
Using classification technology, it is possible to classify large amounts of semi-structured text data, such as WEB pages, e-mails, etc. Images can be categorized, for example, rules that determine what type an image belongs to based on the characteristics and categories of existing images. Spatial data can also be analyzed by classification, for example, a house can be decided according to its geographical location.
(4) Cluster analysis
Cluster analysis is a process of gathering samples without categories into different groups and describing each such group according to the principle of birds of a feather gathering together. The main basis is that samples gathered into the same group should be similar to each other, and samples belonging to different groups should be sufficiently dissimilar.
Taking customer relationship management as an example, clustering technology can be used to subdivide customer groups according to customers’ personal characteristics and consumption data. For example, a consumer group can be obtained as follows: 91% female, 70% childless and aged between 31 and 40, 64% senior consumer, 91% hosiery, 89% kitchen supplies and 79% gardening supplies. For different customer groups, different marketing and service methods can be implemented to improve customer satisfaction.
For spatial data, regions can be automatically divided according to geographical location and the existence of obstacles. For example, residents can be divided into regions based on atMs distributed in different geographical locations. Based on this information, THE ATM installation can be effectively planned to avoid waste and at the same time avoid losing every business opportunity.
For text data, clustering technology can be used to automatically classify documents according to their content, so as to facilitate text retrieval.
(5) Prediction
Prediction is similar to classification, but prediction is the process of estimating the value of a continuous type of variable according to the known characteristics of the sample, while classification is only used to distinguish the discrete category to which the sample belongs. A common technique for forecasting is regression analysis.
(6) Time series
Time series analysis is a series of events that change over time, with the purpose of predicting future trends, or finding similar development patterns or discovering periodic development laws.
4 Data mining process
Data mining refers to the complete process of mining previously unknown, valid, and usable information from large databases and using this information to make decisions or enrich knowledge.
The schematic diagram of data mining environment is shown in Figure 1.
(1) Problem definition
Before starting data mining, the first and most important requirement is to be familiar with the background knowledge and understand the needs of users. Without background knowledge, the problem to be solved cannot be clearly defined, good data cannot be prepared for mining, and results cannot be properly interpreted. To get the most out of data mining, there must be a clear definition of what the goal is, which is to decide what you want to do.
(2) Establish data mining library
Data resources to be mined must be collected in order to carry out data mining. It is generally recommended to collect all the data to be mined into a single database rather than using an existing database or data warehouse. This is because in most cases the data to be mined will need to be modified, and there will be cases where external data will be used; In addition, data mining requires a variety of complicated statistical analysis of data, and data warehouse may not support these data structures.
(3) Analyze data
Analyzing data is the process of investigating data in depth, as is often the case. To find out the law and trend from the data set and distinguish the categories by clustering analysis, the ultimate goal is to make clear the complicated relationship between multiple factors and find the correlation between factors.
(4) Data adjustment
Through the operation of the above steps, the state and trend of the data have a further understanding, at this time to try to solve the problem requirements can be further clarified, further quantified. Add and delete data according to the requirements of the problem, combine or generate a new variable according to the new understanding of the whole data mining process, in order to reflect the effective description of the state.
(5) Modeling
On the basis of further clarifying the problem and further adjusting the data structure and content, the model of knowledge formation can be established. This step is the core of data mining, generally using neural network, decision tree, mathematical statistics, time series analysis and other methods to establish the model.
(6) Evaluation and explanation
The model model obtained above may be of no practical significance or practical value, or may not accurately reflect the real meaning of the data, or even contrary to the fact in some cases. Therefore, it is necessary to evaluate and determine which effective and useful models are available. One method of evaluation is to directly use the data from the previously established mining database for testing, another method is to find another batch of data and test it, and another method is to take fresh data in the actual operation environment for testing.
Step by step implementation of data mining process, different steps require different expertise of personnel, they can be broadly divided into three categories.
- Business analyst. Business proficiency is required to be able to interpret business objects and identify business requirements for data definition and mining algorithms based on each business object.
- Data analyst. Proficient in data analysis techniques and statistics, capable of transforming business requirements into each step of data mining operations, and selecting appropriate technologies for each step of operations.
- Data manager. Proficient in data management techniques and collecting data from databases or data warehouses. It can be seen from the above that data mining is a process of cooperation between various experts, and also a process of high investment in capital and technology. This process should be repeated, in the repeated process, constantly approaching the essence of things, constantly optimizing the solution to the problem.