Here are some knowledge points about big data mining, today and everyone to learn about it.

1. Data, information and knowledge are different forms of generalized data representation.

2. The main types of knowledge models are generalized knowledge, relational knowledge, class knowledge, predictive knowledge, and special knowledge

3. The main schools of Web mining research include: Web structure mining, Web usage mining and Web content mining

4. Generally speaking, KDD is a multi-step process, which is generally divided into problem definition, data extraction, data preprocessing and so on. Basic stages such as data mining and pattern evaluation.

5. The knowledge discovery processing process models in the database include: step processing process model, spiral processing process model, user-centered processing structure model, online KDD model, and KDD processing model that supports multi-data source and multi-knowledge mode

Students who want to learn are welcome to join the big data learning group: 4583+45782, there are a lot of dry goods (zero basis and advanced classical combat) to share with you

6. Roughly speaking, the development of knowledge discovery software or tools has gone through three main stages: independent knowledge discovery software, horizontal knowledge discovery tool set and vertical knowledge discovery solution, among which the latter two reflect the two main development directions of knowledge discovery software at present.

7. The establishment of decision tree classification model is usually divided into two steps: decision tree generation and decision tree pruning.

8. In terms of the main techniques used, classification methods can be grouped into four types:

Classification method based on distance

Decision tree classification method

Bayesian classification method

Rule induction method

9. Association rule mining can be divided into two sub-problems:

Discover Frequent itemsets: Find all frequent itemsets or the maximum frequent itemsets given by the user’s Minsupport.

Generate association rules: Find association rules in frequent item sets given Minconfidence by the user.

10. Data mining is proposed and developed on the basis of the full development of related disciplines. The main related technologies are as follows:

The development of information technology such as database

In-depth application of statistics

Research and application of artificial intelligence technology

11. To measure the effectiveness of association rule mining results, we should consider from a variety of comprehensive perspectives:

Accuracy: Mined rules must reflect the reality of the data.

Utility: Mined rules must be concise and usable.

Novelty: Mined association rules can provide users with new and valuable information.

12. Common types of constraints are:

Monotone constraint;

Antimonotone constraint;

Transformable constraints;

Constraint of simplicity.

13. According to the levels involved in the rules, multi-level association rules can be divided into:

Same-level association rules: An association rule is a same-level association rule if the items corresponding to it are of the same granularity.

Interlayer association rules: If you consider a problem at a different level of granularity, you might get an interlayer association rule

14. According to the main ideas of clustering analysis algorithm, clustering methods can be summarized as follows.

Partitioning: The partitioning of data based on certain criteria.

The clustering methods belonging to this category are: K-means, K-modes, K-prototypes, K-medoids, PAM, CLARA, CLARANS, etc.

Hierarchical method: Hierarchical decomposition of a given set of data objects.

Density method: Evaluation of connected density based on data objects.

Grid method: data space is divided into a grid structure with limited cells, and clustering is carried out based on the grid structure.

Modeling: Assume a model for each cluster, and then look for data sets that fit that model well.

15. The main measures of distance between classes are:

Shortest distance method: Defines the distance between the two closest elements of two classes as the interclass distance.

Longest distance method: Defines the distance between the two farthest elements of two classes as the interclass distance.

Center method: Define the distance between two centers of two classes as the interclass distance.

Class averaging: It calculates the distance between any two elements of two classes and combines them into the distance between classes: the sum of the squares of the deviations.

16. Hierarchical clustering methods can be specifically divided into:

Condensed hierarchical clustering: A bottom-up strategy that first treats each object as a cluster and then merges these atomic clusters into larger and larger clusters until some termination condition is met.

Split hierarchical clustering: A top-down strategy that first places all objects in a cluster and then subdivides them into smaller and smaller clusters until some end condition is reached.

The representative of hierarchical agglomeration is AGNES algorithm. Hierarchical splitting is represented by DIANA algorithm.

17. The methods and objectives of text mining (TD) are diverse, and the basic levels are as follows:

Keyword search: The simplest way, it is similar to traditional search techniques.

Mining project associations: Focus on mining association information between page information (including keywords).

Information classification and clustering: using data mining classification and clustering technology to achieve page classification, the page in a more level of abstraction and sorting.

Natural language processing: Exposing semantics in natural language processing techniques for more precise processing of Web content.

18. Commonly used techniques in Web access mining:

Path analysis

The most common application of path analysis is to determine the most frequently accessed path in a Web site, and such knowledge is very important for an e-commerce site or information security assessment.

Association rule discovery

The association rule discovery method can be used to access transaction sets from the Web and find general association knowledge.

Sequential pattern discovery

In time-stamped transaction sets, sequential pattern discovery means finding internal transaction patterns such as “some items follow another item”.

classification

Discovery classification rules can give a description of the common attributes that identify a particular population. This description can be used to classify new items.

clustering

Customers with similar characteristics can be aggregated from Web Usage data. In a Web transaction log, clustering customer information or data items facilitates the development and execution of future marketing strategies.

19. Data mining languages can be divided into three types, depending on their function and focus:

Data mining query language: Want to use a database query language such as SQL to complete the task of data mining.

Data mining Modeling language: it is used to describe and define data mining models. A standard data mining modeling language is designed so that data mining systems can follow standards in model definition and description.

General data Mining Language: General data mining language combines the characteristics of the above two languages. It not only has the function of defining models, but also can be used as a query language to communicate with data mining system for interactive mining. The standardization of universal data mining language is an attractive research direction to solve the problems in data mining industry.

20. There are four strategies for rule induction: subtraction, addition, addition and subtraction, and subtraction and addition.

Subtraction strategy: Taking specific examples as the starting point, the examples are promoted or generalized, namely, the conditions (attribute values) or the conjunctive terms are reduced (for convenience, we do not consider the promotion of additional disjunctive terms), so that the promoted examples or rules do not cover any counterexamples.

Addition strategy: The initial assumption is that the condition part of the rule is empty (the permanent truth rule), and if the rule overwrites the counterexample, the condition or conjunction is continuously added to the rule until the rule no longer overwrites the counterexample.

First add and then subtract strategy: Due to the correlation between attributes, the addition of a condition may cause the previous condition to be useless, so the previous condition needs to be deleted.

Subtract-later strategy: The same principle is used to deal with correlation between attributes.

21. Data mining has broad and narrow definitions.

In a broad sense, data mining is the process of mining the knowledge hidden in large data sets (which may be incomplete, noisy, uncertain and various forms of storage), which is not known in advance and useful for decision making.

From this narrow view, we can define data mining as the process of extracting knowledge from a particular form of data set.

22. The meaning of web mining: including the web page content, page structure, between the user access to information, e-commerce, etc., all kinds of web data, the application of data mining technology to help people to extract knowledge from the Internet, for visitors, the site operator and based on the business activities of the Internet, including e-commerce, provide decision support.

23. Definition of K-nearest Neighbors (KNN for short) : By calculating the distance between each training data and the tuple to be classified, the K training data Nearest to the tuple to be classified is taken, and the tuple to be classified belongs to the category in which the training data in the K data occupies the majority.

24. Performance analysis of k-means algorithm:

Main advantages:

It is a classical algorithm to solve the clustering problem, simple and fast.

The algorithm is relatively scalable and efficient for processing large data sets.

It works better when the resulting cluster is dense.

Main drawback

Use only if the mean value of the cluster is defined and may not be suitable for some applications.

K (the number of clusters to be generated) must be given in advance and is sensitive to initial values, which may result in different results.

It is not suitable for finding clusters of non-convex shape or clusters of very different sizes. Moreover, it is sensitive to “noisy” and outlier data.

25. Performance analysis of ID3 algorithm:

The hypothesis space of ID3 algorithm contains all decision trees, which is a complete space of finite discrete value functions of existing attributes. So the ID3 algorithm avoids one of the main risks of searching an incomplete hypothesis space: the hypothesis space may not contain the target function.

ID3 algorithm uses all the current training samples in each step of the search, greatly reducing the sensitivity to individual training sample errors. Therefore, by modifying the termination criterion, it can be easily extended to deal with training data containing noise.

ID3 algorithm does not backtrack during search. As a result, it is susceptible to the common risk of backward-free mountain climbing searches: convergence to local rather than global optimality.

26. Apriori algorithm has two fatal performance bottlenecks:

Scanning a transactional database multiple times requires a large I/O load

For each K loop, each element in the candidate Ck must be scanned once by the database to verify that it is added to Lk. If you have a frequent large item set with 10 items, you need to scan the transaction database at least 10 times.

May produce a large pool of candidates

The k-candidate Ck generated by LK-1 grows exponentially, for example 104 1-frequent item sets have the potential to produce a 2-candidate with nearly 107 elements. Such a large selection is a challenge to both time and main memory space. A method based on data segmentation: The basic principle is that “k-item sets whose support is less than the minimum support in a partition cannot be globally frequent”.

27. The main improvement methods to improve the adaptability and efficiency of Apriori algorithm include:

Partition based methods: The basic principle is that “k-item sets whose support in a Partition is less than the minimum support cannot be globally frequent”.

Hashing based approach: The basic principle is that “k-itemsets with less than minimum support in a Hash bucket cannot be globally frequent”.

Sampling-based methods: The basic principle is to “evaluate the sampled subsets through sampling techniques, and in turn estimate the global frequency of k-item sets”.

Others: for example, delete useless transactions dynamically: “transactions that do not contain any Lk have no impact on future scanning results, so they can be deleted”.

28. Web-based data mining is much more complex than data mining for databases and data warehouses:

Heterogeneous data source environment: The information on the Web site is heterogeneous: the information and organization of each site is different; There are a lot of unstructured text information and complex multimedia information. Site use and security, privacy requirements vary and so on.

Data is complex: some are unstructured (Web pages, for example), often with long sentences or phrases expressing document-like information; Some may be semi-structured (e.g. Email, HTML pages). Of course, some have nice structure (e.g., spreadsheets). It is an unshirkable responsibility of data mining to uncover the general description features contained in these composite objects.

Dynamic changing application environment:

Information on the Web is constantly changing, and information such as news and stocks is updated in real time.

This high variation is also reflected in dynamic linking and random access of pages.

Users on the Web are unpredictable.

The data environment on the Web is noisy.

29. Briefly describe the process management i-MIN process model of knowledge discovery project.

The MIN process model divides KDD process into IM1, IM2,… In each step, several issues are discussed and the implementation of the project is controlled according to certain quality standards.

Task and purpose of IM1: it is the planning stage of KDD project. It determines the mining target of the enterprise, selects the knowledge discovery mode, and compilers the metadata obtained by the knowledge discovery mode. Its purpose is to embed the mining goal of the enterprise into the corresponding knowledge pattern.

IM2 task and purpose: it is the pretreatment stage of KDD, and IM2a, IM2b and IM2c can be used to correspond to the stage of data cleaning, data selection and data conversion respectively. The purpose is to generate high quality target data.

IM3 task and purpose: It is the mining preparation stage of KDD. Data mining engineers conduct mining experiments to repeatedly test and verify the validity of the model. The aim is to produce a Knowledge Concentrate through experimentation and training that provides a usable model for the end user.

IM4 task and purpose: It is the data mining stage of KDD. Users can obtain the corresponding knowledge by specifying the data mining algorithm.

IM5 task and purpose: it is the knowledge representation stage of KDD, forming normalized knowledge according to specified requirements.

IM6 task and purpose: it is the knowledge interpretation and use stage of KDD, and its purpose is to output knowledge intuitively according to user requirements or integrate it into the knowledge base of the enterprise.

30. The main improvement methods to improve the adaptability and efficiency of Apriori algorithm include:

Partition based methods: The basic principle is that “k-item sets whose support in a Partition is less than the minimum support cannot be globally frequent”.

Hashing based approach: The basic principle is that “k-itemsets with less than minimum support in a Hash bucket cannot be globally frequent”.

Sampling based method: The basic principle is to “evaluate the sampled sub-set through Sampling technology and estimate the global frequency of k-item set in turn”.

Others: for example, delete useless transactions dynamically: “transactions that do not contain any Lk have no impact on future scanning results, so they can be deleted”.

31. What are the two steps of data classification?

Build a model that describes a predetermined set of data classes or concepts

A data tuple is also called a sample, instance, or object.

The tuples of data analyzed for modeling form training data sets.

A single tuple in a training dataset is called a training sample, which is also called guided learning because the class label of each training sample is provided.

The classification model is constructed by analyzing the training data set, which can be provided in the form of classification rules, decision trees or mathematical formulas.

Classify using models

First assess the predictive accuracy of the model (taxonomy).

If the accuracy of the model is acceptable, it can be used to classify data tuples or objects with unknown class labels.

32. Features of Web access information mining:

Web access data has large capacity, wide distribution, rich connotation and diverse forms

A medium-sized website can record several megabytes of visitor information per day.

Widely distributed throughout the world.

Access information varies in form.

Access to information is rich in content.

Web access data contains information available for decisions

The characteristics of each user’s visit can be used to identify the characteristics of that user and site visits.

The access of the same type of users represents the personality of the same type of users.

Access data over a period of time represents the behavior and commonality of group users.

Web access information data is the bridge between Web designers and visitors.

Web access to information data is a good object for data mining research.

Features of Web access information mining objects

The element accessing the transaction is the Web page, and there is rich structural information between the transaction elements.

The elements accessing a transaction represent the sequential relationship of each visitor, and there is rich sequential information between transaction elements.

The content of each page can be abstracted into different concepts, which are partly determined by the order and number of visits.

Users have different access duration to the page, and the access duration represents the user’s interest in accessing the page.

33. Text information mining in Web pages:

The goal of mining is to summarize and categorize pages.

Page summary: applying the traditional text summary method to each page can get the corresponding summary information.

Page classification: The classifier inputs a set of Web pages (training set), then conducts supervised learning based on the text information content of the page, and then the learned classifier can be used to classify each newly entered page.

{A common method in text learning is TFIDF vector representation, which is a bag-of-words representation of a document, in which all Words are extracted from the document regardless of the order between Words and the structure of the text. This method of constructing two-dimensional tables is:

Each column has one word, and the column set (feature set) is all the distinguishing words in the dictionary, so the whole column set may number in the hundreds of thousands of columns.

Each row stores information about one word in a page, and all words in that page are mapped to a column set (feature set). Each column (word) in the column set has a value of 0 if it does not appear on the page; If it happens k times, it’s going to be k; Words on a page that do not appear on a column set can be discarded. This method can represent the frequency of words on the page.

For Chinese pages, it is necessary to divide words first and then carry out the above two steps.

The constructed TWO-DIMENSIONAL table represents the statistical information of words in the Web page set, which can finally be classified and mined by Naive Bayesian methods or K-nearest Neighbor methods.

Before mining, feature subsets are usually selected to reduce the dimension.