One, foreword

“This is the fifth day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

Hello, I am Wang Lao Shi. Careful partners should find that I changed my name, specific reasons? After all, it’s been a year and I’ve grown up, and DarkKing feels a little too middle-of-the-road, so he changed his name to something more mature. (Am I going to tell you I have naming trouble?)

With the late period of the Internet and the rise of the Internet of Things, even Internet companies have not satisfied the real world, the birth of the concept of the meta-universe began to move towards the virtual world. Including Shanghai will also meta-universe as a key construction direction to layout. As leek are you trembling. After harvesting in the real world, harvesting continues in the virtual world. There’s no room for room on the floor.

We don’t comment on whether the meta-universe is good or bad. But the more people interact with each other through the cloud, the more data will be generated. Big data storage has also evolved from warehouse to data lake. Therefore, mastering some knowledge of big data is very necessary for future development and understanding trends. A new column will follow, detailing the process of building a data architecture from zero to one. It also provides some insights into how to solve some of the problems encountered. Interested friends can discuss together. Let’s talk about it today. Data quality.

Two, the data is often questioned inaccurate how to do?

Every day, the first thing that industry leaders do when they come to the company is to turn on the computer and look at yesterday’s business data, see how the overall benefit is, and then analyze the data to formulate the following operation strategy. One day, when my colleague in the production line opened the data in the morning to make data analysis and formulate operation strategy, he found that all the data were 0 yesterday, and then an abnormal call was reported to the data department.

After receiving complaints of abnormal data, the data department stepped in to locate the cause of the problem. Start from the downstream ADS application layer table. However, the output of data indicators may be generated from several or even dozens of tables. Checking one by one is obviously very time-consuming. Work your way up floor by floor. It was finally found that ODS was not collected in the data collection layer. The collection module has some problems because the collection function is unavailable due to insufficient physical resources. And then we start expanding, re-collecting, re-running data. Restore data.

The investigation of the problem took nearly a morning, and the recalculation of the problem processing data basically took another half day, and the data was unavailable for nearly a day. Business progress in other divisions has also been affected.

So is the data department happy as a business unit? Therefore, how to ensure data quality is a problem that data teams must overcome. So what are the goals we need to achieve in order to ensure data quality?

  1. How can data exceptions be discovered before the business side? Don’t wait for the business side to complain and not find the problem
  2. How to quickly locate the problem? Instead of checking from list to list.
  3. How to fix data problems quickly? Data failure occurred in the data acquisition layer, so the data synchronization, processing, calculation and output have to be re-run in the later period, and the repair time is extremely high.

Third, the root of data quality problems

Through several years of data development experience, data development problems mainly include the following points

1. Business system changes

Because the data is mainly dependent on the business side, when the business side releases the version iteration, the data side is not notified of the changes in the tables and logs corresponding to the output data, which may lead to data anomalies. Generally, there are the following four situations.

  • Service system data is switched to a new table, but the old table cannot be written. As a result, the data team fails to update data.
  • The structure of the service system table changes, causing data synchronization exceptions.
  • The service system environment changes, causing data collection exceptions
  • The format of the service system log is abnormal, causing the collection exception

2. Insufficient system resources

In the big data system, data calculation is basically processed on the common cluster. Resources are managed through YARN.

Resources are a very scarce thing. Improper allocation of computing resources, or SQL optimization that is not very good, can lead to insufficient resources. The calculation fails. There are also several main reasons.

  • Data computing memory resources are insufficient, causing data task exceptions
  • The data storage disk is insufficient, causing the task exception.
  • Task computing time is crowded and system resources are preempted.
  • Computing tasks are added temporarily and system resources are not expanded in time
  • Slow SQL queries exist in the system, which delays the calculation time and overloads the calculation tasks.

3. Unstable infrastructure

Such anomalies are rare, but when they do occur, the effects are global and deadly.

  • Computer room power outages
  • The physical server hangs up
  • Network instability
  • Open source system component bugs themselves

4. System code bugs

This kind of problem usually occurs the most, especially some abnormalities of the business code of big data itself or problems in task configuration, resulting in data calculation errors or task execution failure

  • The presentation layer code is buggy
  • The data development code is buggy
  • Data task publishing online is abnormal
  • The data task configuration is abnormal

How to improve data quality

How to improve the quality of our data and improve business satisfaction with the data team. Then we need to provide targeted solutions to our goals and identify the root causes of the problems.

  1. Identify data problems early
  2. Quickly locate the cause of a problem
  3. Improve the data recovery speed

1. How to find data problems in advance

Infrastructure resource alarms are generated

Add resource alarms for infrastructure to ensure sufficient computing resources. For example, if the computing resources reach 90%, a phone alert is required.

Add data audit rules

Through the data completed by task processing, design a set of verification rules for business, such as whether the number of rows before and after the table is consistent, enumeration of key fields, whether the maximum and minimum values of fields are within the expected, etc., which is the most effective way to improve data quality and ensure the accuracy of data. At the same time, the addition of field-level verification rules can also make some backtracking on the data quality of the business side and discover some unreasonable values in the business side database for governance. Because the audit rules are closely related to the business, there are no open source components in the industry, which are basically developed by the company itself. But the implementation is relatively simple. The general practice is as follows:

  1. Add corresponding inspection rules for each data task, and perform the inspection after data synchronization or computing tasks are completed.
  2. Check whether the audit task passes. If it passes, the task is executed normally
  3. If the inspection task does not pass, an alarm will be sent, and the development will judge whether it needs to be re-executed or check whether the task is abnormal
  4. If the computing task is a strongly rule-dependent task, stop the subsequent computing tasks.
  5. If it is not a strong rule service, subsequent computing tasks are not affected. You can continue to perform the tasks.

Common data inspection rules should meet the following requirements:

  1. Data integrity

Data integrity as the name implies, we need to ensure that records are complete and not lost, such as whether the number of table rows is consistent at the table level, whether the primary key is unique, etc., and the volatility of data during collection. There is also field-level monitoring, such as whether key fields are non-null, non-zero values, and whether they are no longer in the enumeration range. 2. Consistency of data Consistency of data mainly means that our data should be the same across different models. For example, the total number of users yesterday was 2W, and the number of active users yesterday was 2000, accounting for 20% of active users. These three metrics are inconsistent, since active users should be active users/cumulative users at 10%. However, the active user ratio may be calculated using the number of registrations in other models, resulting in data inconsistency. Therefore, the consistency of data emphasizes that there should be only one source of data, and other data indicators that rely on this data should be generated from this table. This is a very important standard in data modeling. 3. Accuracy of data Accuracy of data is mainly to ensure that data entries are accurate. For example, it is impossible for a user to place an order earlier than the release date of the product. The number of active users cannot be higher than the number of registered users.

The inspection tasks are as follows:

The calculation task is linked with the audit task to ensure the accuracy of the business data when the data is running.

NetEase’s template inspection rule configuration model, for those interested, please refer to:

Visualization of audit tasks

Through the visualization of the audit task, the research and development can find the task execution more clearly and quickly analyze the cause of the problem.

I don’t know if more inspections are better

The audit task itself is a resource consuming task, so we need to distinguish the importance level of the business indicators, the importance level must be added to the audit task. General important on demand increase. Use resources wisely and reduce costs.

2. How to quickly locate the cause of the problem

Data related

Data warehouse design is usually hierarchical. Good model data will have a high data reuse rate, and an intermediate result may be consumed by multiple data models, which will lead to poor data processing links. It also takes a long time to locate problems. Therefore, it is very necessary to establish the whole link monitoring and data blood relationship.

As can be seen from the following figure, the establishment of blood relationship is generally through the starting point of business data sources, recording the processing process of big data, binding of indicators and application of indicators, and establishing the full processing path of data. In this way, when the data indicator is faulty, the node failure can be quickly located through the data link. In order to respond quickly to solve.

Realization of data consanguinity

Atlas of Apache is an open source metadata governance system of Apache, which provides core capabilities of metadata governance for Hadoop cluster, including data classification, centralized policy engine, data kinship, security and life cycle management. Automatically dumps task execution information and data metadata into Atlas by configuring hooks in the Hadooop ecosystem component. Before the installation of Atlas I have wrote a blog, you interested can look at [: Atlas of installation and use] (blog.csdn.net/b379685397/…).

Altas Table metadata information

Altas data lineage

If you are not a Hadoop cluster component, you may need to implement the corresponding hook functions yourself.

Due to the large number of big data and business storage components, it is difficult to fully implement the genetic system of each storage component and synchronization task. So it’s more about developing your own blood components based on the business. There are two types of kinship component association: manual and automatic.

Manual blood relation

Manual association requires visualization of the entire data development process. When configuring collection, computing tasks and index association, the blood relationship can be written into the blood system through configuration. Finally, do a data presentation. This method has high requirements for data visualization. If the developer executes the script manually, the lineage may not be recorded. Lead to consanguinity indeed. Maintenance is difficult.

Automatic blood correlation

Automatic blood relation is to record data flow through SQL parsing or log during the execution of data tasks. Automatically written into the blood system. The commonly used methods include: 1. Burying point of data engine 2. SQL PROXY LOG Select different methods according to different storage engines. For example, using SQL, you can use SQL parsing to record where data comes from and where it goes. If it is ES, it can be recorded by data engine burying point.

Display of data consanguinity

Xiaomi’s data consanguinity display form

NetEase’s data consanguinity display form

3. How can I speed up data recovery?

With the first two steps, data problems are found and located in time, so the next step is to perform data recovery. Data abnormal recovery is generally analyzed through some problems that often occur. Enhance the robustness and fault tolerance of applications. For example, the log task should have the ability to complete offline, and the data synchronization interruption should have the ability to compensate for interruption or incremental update. Of course, it is more important that we classify the data according to their importance and restore them according to their priority.

5. Standardization of management system

In the powerful technology also can not overcome the chaos of management, even if there are so many capabilities described above, if there is no development of rules or rules are not comprehensive, the alarm issued by no one to deal with, or can not reach the early detection and early processing requirements. Therefore, establishing a complete data development process system is an important guarantee to ensure data quality. Ok, that’s all for today, and we will continue to talk about data governance later.