This is the fourth day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

What is machine learning?

Machine learning is to enable machines to have human-like learning ability, and use statistics, probability theory and algorithm knowledge to analyze data, in particular scenarios to make the expected behavior.Copy the code

In machine learning, data is the most important. It is necessary to use a large amount of data to train the machine and let the machine make decisions based on data.

When human beings know something, they often first associate the characteristics of the thing with the thing itself, so do machines. For example, in order for the machine to know the elephant, it should first tell the machine the characteristics of the elephant, such as long nose, big ears and big body. These characteristics will be represented by numerical values in machine learning, and such numerical values are called feature quantities. When these values are combined, such as (100, 8, 70), they are called feature vectors.

Difference between data analysis and data mining

We can analyze the data, or we can mine the data, and the difference between the two is as follows.

The direction of The data analysis Data mining
methods Use statistical knowledge to derive results Results were obtained using statistics + machine learning
preferences business technology
implementation Use Excel to realize numerical calculation and visualization The use of programming, the use of machine learning technology to achieve numerical calculation and visualization
The results of Present data Use data to make predictions

The boundary between data analysis and data mining is becoming increasingly blurred, and data analysts are gradually using techniques such as machine learning to process larger data and achieve greater value for data.

Machine learning process

The process of machine learning is: data source -> data preprocessing -> feature engineering -> data modeling -> data validation.

The data source

When users use software, they will generate a series of behaviors, such as clicking, commenting, and stopping. The front-end sends these behavior data to the server, and the server saves the data to the database or file server, such as MySQL, HBase, Hive, and HDFS. Data can then be obtained from these sources for data preprocessing, analysis, modeling, and validation.

Data preprocessing

Once you have the data source, the next step is to do data preprocessing, that is, to clean out the required data. Some dirty data should be processed accordingly. The following table describes common types of dirty data and their handling methods.

Dirty data Processing method
id Usually renountable
Missing value Discard or fill, depending on the business situation
outliers If the age is illegal, special values may be assigned, depending on the business situation

Characteristics of the engineering

The so-called feature engineering is the statistical analysis phase, where you get clean data, you analyze it statistically, you visualize it, and then you model it mathematically.

Data modeling

Suppose that after statistical analysis there are only two kinds of data, one kind of data in terms of x and one kind of data in terms of y, we can find a function f(x), for any x, y = f(x). One such process is data modeling. For each future value of x, you can use a fixed f(x) to get y, which is simple data prediction.

Data validation

The final stage is data validation, which uses test data to verify the accuracy of the model.

The analysis model

Data analysis has a set of standard analysis models, which can help us better extract data value.

LRFMC is a widely used analysis model that represents a data analysis indicator.

  • L. Relationship length. The time interval between user generation and data usage.
  • R. Time interval of consumption. The interval between the last time a user used the service until data began to be used.
  • F. Consumption frequency. The number of times a user has used the service over a period of time.
  • M. Consumption time. Total length of time a user has used the service.
  • C. The average discount coefficient. The average discount rate of consumer spending.

After cleaning the data, the data can be extracted according to the five indicators, that is, to find out the corresponding fields, and through calculation, the five indicators can be obtained.

An important role of LRFMC model is to help us classify users and adopt targeted strategies for different types of users. For example, users with long-term stability and heavy consumption can be classified according to the model, and these users can be given priority to services, while users with low investment and little use of services can be paid less attention.