Although the value of data has been widely accepted, for most people, the actual application of data is still very mysterious. Even for some data practitioners, it is hard to understand. As a result, a lot of companies talk about being data-driven, but don’t really show the value of data.

Data, business and algorithm are a trinity closed-loop system. Only by embedding data thinking solidly into business can we dig out the trends and laws contained in data.

Through this article, I hope to talk with you about how data is applied to business step by step, and help you understand the workflow of data experts.

Define the problem

Not every problem can be solved with data. Many problems appear to be data problems, but the reality is that data can play little role, and there are many problems that don’t seem to be solved with data, but when abstracted, data can create value for these problems.

Many content recommendation platforms hope to find “good” content. However, judging the quality from the content data itself without user actions often fails to meet the requirements, because the concept of “good” is too broad and there is no clear standard.

We can abstract the problem by changing “find good content” to “find content with high completion rate”. In this way, the problem has a clear, quantifiable goal, which is consistent with how the machine handles the problem.

Machines are different from people in their way of thinking. People think in a network and can solve problems in a divergent way, while machines are linear. Every decision needs to have a clear and quantifiable goal.

In order for machines to learn from data, we need to translate human problems into a form that machines can understand.

At the same time, seemingly identical objects, from the machine’s point of view, can find completely different results. Therefore, when defining the goal of the problem, you need to be very precise in order to achieve the desired effect.

For example, if a platform wants to increase user retention, but the goal is click through rate, the effect may be the same, but the end result is very different.

To prepare data

In the actual work, more than 80% of the time is in the preparation of data, and it is the most important technical link in the process, which is like “even a clever housewife can’t cook a meal without rice”.

However, what kind of data is the data that the machine needs and that is of high quality?

You might think that more data is better, but in fact, more comprehensive data is better. It’s like, you can’t know the taste of coke no matter how much mineral water you drink.

When faced with a specific problem, we need to determine whether the data describing the problem is comprehensive enough to include data from different aspects of the problem. At the same time, we need to adjust the proportion of different samples in the data to ensure that the machine can learn adequately.

In statistics, statistical inferences are only valid if the sample is uniform.

Therefore, manual annotated data is often needed to enhance machine learning ability. In addition to user side requirements, the functions of “like” and “favorites” in the product also include data annotation requirements.

Characteristics of the engineering

Humans can easily process unstructured data, but machines can only process structured data. If the data describing the problem cannot be characterized (datalized), then the machine cannot learn any rules.

In feature engineering, the judgment of a data specialist comes from a lot of practical experience and understanding of the business.

Therefore, a good data expert must be very familiar with the business, can establish a technical system from original data to characteristic data, can fully cover the original business experience, or even beyond the limitations of the original experience.

It is not only necessary to clean, associate and organize the data, but also to capture the deep data behind the original data.

There are three ways to extract deep data:

  1. Time, through the change of data in different time dimensions, extract new data.
  2. Scenarios: Refine new data by combining data changes in different scenarios.
  3. Cross, through the cross-comparison between data, extract new data.

Many times, when a lot of weak data is combined and associated with each other to generate new data, this composite data can become the key to solving a problem.

Algorithm tuning

Real data is often extremely complex and requires simple, robust algorithms to conquer it. A good algorithm is one that does not waste data and maximizes the value of data. It is based on different data structures that give play to the value of data.

Data is the material to solve the problem, algorithm is the tool to solve the problem. There are three main categories of algorithm directions:

  1. Rulesism: Makes no assumptions about data, but extracts a set of decision rules directly from real data and assumes that these decision rules apply to all new data.
  2. Frequentism: the assumption that the data to be learned follows some ideal statistical distribution, and the use of mathematical techniques to infer rules from ideal data.
  3. Bayesianism: They do not make any inferences from the data, but find correlations between different cases.

In order to decide which algorithm is better, it is necessary to set automatic selection algorithm and automatic parameter adjustment algorithm according to the experimental results, so that the machine can automatically select the most suitable algorithm and tool for the current data.

Just as understanding how an engine works doesn’t help you get better at driving, it takes practice to get the best results.

If machine learning algorithms are powerful engines, the engine needs an easy-to-use steering wheel, and visualization is the steering wheel of machine learning technology.

The algorithm can only be tuned if the various values are visualized.

conclusion

Experienced data experts can rely on business experience, intuition and logical reasoning to extract a large number of predictive data features, and quickly find the algorithm to solve the problem.

So that explains why data engineers are worth more as they get older.

Finally, we have a book entitled “Understanding NLP Chinese Word Segmentation: From Principle to Practice”, which will help you master Chinese word segmentation from scratch and step into the door of NLP.

If the above content is helpful to you, I hope you can like, comment, forward, thank you!