The ten-step principle solves data quality problems

I. Related concepts

1.1 Data Quality

The extent to which a set of inherent properties of data meets the requirements of data consumers.

1) Inherent attributes of data

Authenticity: Data is a true reflection of the objective world
Timeliness: data is up to date with changes
Relevance: data is what consumers of data care about and want

2) High quality data meet requirements (consumer perspective)

Available, available to data consumers when they need it;
Timely, data available and updated as needed;
Complete, data is complete without omission;
Secure, data is secure from unauthorized access and manipulation;
Comprehensible, data that is comprehensible and interpretable;
True, data is a true reflection of the real world.

1.2 Data quality management

Data quality management, it is to point to the data from planning, acquisition, storage, sharing, maintenance and application, and in each stage of life cycle may be caused by all kinds of data quality problems, identification, measurement, monitoring, early warning and a series of management activities, and through the improvement and improve the management level of organization makes data quality further improved.

Second, evaluation dimensions

Any improvement is based on an assessment of where the problem is before it can be implemented. Generally, data quality assessment and management assessment should be measured in the following dimensions. The following dimensions are common:

1) Completeness

Integrity refers to whether data information is complete and missing. Missing data can be an entire data record or a record of a particular field in the data. Record integrity, generally using the number of records and unique values of statistics. On the other hand, data missing from a field in the record can be audited using the number of NULls in the statistics. Generally, the proportion of null values is basically constant. The number of statistical null values can also be used to calculate the proportion of null values. If the proportion of null values increases significantly, it is likely that there is a problem in the record of this field and information is missing. In a word, the integrity can be measured by the number of records, mean value, unique value, null value ratio and other indicators.

2) Standardization

Normative refers to whether records conform to specifications and are stored in a specified format (for example, standard coding rules). Data normative audit is an important and complicated part of data quality audit. Normative testing mainly tests whether data and data definitions are consistent and can therefore be measured by the ratio of compliance records. For example, the value range is enumerated set of data, the actual value of the data out of the range, such as the presence of a specific encoding rule of the attribute value does not conform to its encoding rule of the record proportion.

3) Consistency

Consistency refers to whether the data is logical or not. There are logical relationships among single or multiple data in the data. Consistency test, the verification between attributes with logical relationship, for example, when attribute A takes A certain value, the value of attribute B should be within A specific data range, can be measured by compliance rate.

4) Accuracy

Accuracy, which measures which data and information are incorrect or overdue. Accuracy may exist in individual records as well as in entire data sets. The difference between accuracy and standardization is that standardization focuses on compliance and unity, while accuracy focuses on data errors. Therefore, the same data representation, such as the actual value of the data is not within the defined range, if the defined range is accurate and the value is completely meaningless, then this is a data error.

The accuracy of the data may exist in individual records as well as in the entire data set. Errors in a single field of the entire dataset are easy to spot, as are averages and medians. When there are individual outliers in the data set, maximum and minimum statistics can be used to audit, or a boxplot can be used to make the outliers obvious.

There are also several accuracy audit problems, character garbled problems or character truncation problems, which can be found using distribution. Generally, data records conform to normal distribution or normal-like distribution, so those data items with abnormally small proportions are likely to have problems. If there is no significant anomaly in the data, the recorded values may still be wrong, but the values are close to the normal values. This kind of accuracy test is the most difficult, and generally can only be compared with other sources or statistical results to find the problem.

5) Timeliness

The time interval between data generation and viewability, also known as data latency. Some real-time analysis and decisions require hourly or minute-level data. These requirements require high timeliness of data, so timeliness is also a component of data quality. For example, define the latest date of a month for a table.

6) Uniqueness

Uniqueness is used to measure which data is duplicated or which attributes of the data are duplicated. A measure of accidental duplication of a particular field, record, or data set within or between systems.

7) Rationality

Rationality is to judge whether the data is correct from the perspective of business logic. The evaluation can refer to normative and consistent practices.

8) Redundancy

Redundancy refers to whether there is unnecessary data redundancy in multi-level data.

9) Accessibility

Accessibility refers to whether data is easy to obtain, understand and use.

Iii. Influencing factors

The factors affecting data quality mainly come from four aspects: information factor, technical factor, process factor and management factor.

1) Information factor

The main reasons for this part of data quality problems are: metadata description and understanding errors, various properties of data measurement (such as: data source specifications are not unified) can not be guaranteed and the frequency of change is inappropriate.

2) Technical factors

It mainly refers to the data quality problems caused by the abnormality of each technical link of specific data processing. The production of data quality problems mainly includes data creation, data acquisition, data transmission, data loading, data use, data maintenance and other aspects.

3) Process factors

It refers to the data quality problem caused by improper setting of system operation process and manual operation process, mainly from the creation process, transfer process, loading process, use process, maintenance process and audit process of system data.

4) Management factors

It refers to data quality problems caused by personnel quality and management mechanism. Such as personnel training, personnel management, training or reward and punishment measures caused by improper management or management defects.

Methods to solve quality problems

You can follow the following ten-step rule (excerpted from the omidogo publication).

4.1 Define business requirements and methods

Find out which businesses are affected by data quality problems or will bring better business benefits to the enterprise due to data quality improvement, evaluate these business needs and rank them according to the importance level as the target and scope of this data quality improvement. Only by clarifying the business requirements and methods can we ensure that the data quality problems to be solved are related to the business requirements and thus truly solve the business problems.

4.2 Analyzing the Information Environment

Refine defined business requirements, identify information related to business requirements and data, data specifications, processes, organizations, and technologies (such as systems and software), define information life cycles, and determine data sources and scope. By analyzing the information environment, we can not only provide help for the subsequent cause analysis, but also have a more comprehensive and intuitive understanding of the data problems and status quo.

4.3 Evaluate data quality

Extract data from related data source, the defined business requirements, design data evaluation dimensions and use the relevant tools to complete the assessment, the data quality assessment results accurately expressed in the form of a chart or report, make relevant leaders or business people are able to clear and intuitive understanding of the actual data quality, ensure the data problem is related to business needs, And can get the attention and support of relevant leaders or business personnel.

4.4 Assessing service Impact

Understand how poor quality data affects the business, why it is important, and what business value it can bring if it is improved. The more complex the assessment method, the longer it takes, but it is not necessarily proportional to the impact of the assessment, so it is important to pay attention to the choice of methodology when assessing business impact. Also, document business impact assessments in a timely manner so that problems can be traced over time even if they are played down.

4.5 Determine the root cause

Before correcting a data problem, determine the root cause, which can be many. However, the occurrence of some problems is only the appearance, not necessarily the root cause of incorrect data, so in the process of analysis, it is necessary to constantly track the data to locate the problem, and determine the root cause of the earliest occurrence of the problem; Or ask yourself “WHY” several times to find out the root cause of the problem, so that the problem can be effectively solved, to achieve the effect of treating both the symptoms and the root cause.

4.6 Develop improvement plans

Through the detailed problem analysis and cause determination in the previous steps, reasonable data quality improvement plans can be formulated in this step, including improvement suggestions for known data problems and how to prevent the occurrence of similar wrong data in the future.

4.7 Preventing Future Data Errors

Prevent future incorrect data based on solution design.

4.8 Correcting Current Data Errors

Solve existing data problems according to solution design. This step is more of a “dirty job,” but is critical to achieving the ultimate quality goal.

4.9 Control and monitoring

Implement continuous monitoring to determine whether the desired results have been achieved.

4.10 Communicate actions and results

Communicate the results and progress of the project to ensure the continuous progress of the overall project.

5. Data quality Product design

5.1 Data Product value

Complete check standard carding method and index rule template.
Automatic inspection processing and problem notification mechanism, to achieve unattended.
Provide comprehensive data analysis mechanism to speed up problem solving.
Standardized problem management process and system, accurate management of each stage of the problem.
Perfect quality problem solving and sharing mechanism to achieve closed loop management of data governance.

5.2 Troubleshooting Process

Determine rules: Data quality indicators
Problem found: data quality check
Raising a problem: Quality alarm
Problem solving: quality problem analysis
Generalizing problems: problem management process

5.3 Main Function Modules

1) Quality assessment

Provide comprehensive data quality assessment capabilities, such as data repeatability, relevance, correctness, completeness, consistency, compliance, etc., perform physical examination of data to identify and understand data quality issues. With an evaluation system as a reference, data collection, analysis and monitoring are needed to provide comprehensive and reliable information for data quality. Collection points are set at key points in the data flow process, and corresponding collection rules are configured according to the requirements of the system for data quality. Through quality data collection and statistical analysis at the collection point, the data analysis report at the collection point can be obtained.

2) Check execution

Provides the capability of generating configured measurement rules and check methods, and provides the scheduled execution of check scripts and third-party scheduling tools.

3) Quality control

The system provides an alarm mechanism to set thresholds for check rules or methods, and generates alarms and notifications of different levels for rules exceeding thresholds.

4) Problem management

Support the process processing of data problems, standardize the mechanism and steps of problem processing, strengthen problem authentication, and improve data quality. Through quality evaluation system and quality of data acquisition system, can be found that the problem, then you also need to react to find problems in a timely manner, traces the problem causes and formation mechanism, take the corresponding improvement measures according to the types of problems, and continuous tracking verify the improvement of data quality improvement effect, form a positive feedback, to achieve the effect of the data of continuous quality improvement.

Establish data standards or access standards at the source, standardize the definition of data, establish processes and systems to monitor the quality of data conversion in the process of data transfer, try to solve problems where problems are found, and do not bring problem data to the back end.

5) Quality report

The system provides a rich API for customization of data quality including development, in addition to the system built-in common quality reports.

6) Quality analysis

Provides a variety of problem analysis capabilities, including pedigree analysis, impact analysis, whole chain analysis, locating the root cause of the problem.

Author: Han Feng

First published in the author’s personal public number “Han Feng Channel”.

Source: Creditease Institute of Technology