1. Operations on the Apache Griffin user interface
Apache Griffin is an open source data quality solution for distributed data systems of any size in a streaming or batch data context.
It also provides an Angular interface that makes it easier to manually set source data, target data, metrics, and results.
Process of 2.
After logging in to the system, perform the following operations:
- First, create a new metric.
- Then, create a job to process the metric on a regular basis.
- Finally, the heat map and dashboard will display a data map of the metric.
2.1 the data source
Click “DataAssets” in the upper right corner to view the DataAssets
Here you can view all the data sources
2.2 Creating Indicators
By clicking Measures, and then selecting Create Measures. You can use this metric to process the data and get the desired results.There are four main indicators to choose, which are:
- Accuracy can be selected if you want to measure the degree of match between the source and the target.
- Select Profiling if you want to check the data for a specific value (for example: number of empty columns).
Currently, only accuracy Measure can be created on the UI.
2.2.1 Accuracy define: Measures how well the source data matches the direct data of the target data
Steps:
1. Select source data Select the source database and fields to be compared2. Select a target
Select the target database and fields to compare3. Map the source and target
- Step1: “Map To” : Select rules that match source and target data. Here are 6 options to choose from:
I. = : The data in the two columns should match exactly. ii. ! = : The data for the two columns should be different. Iii. > : The target column data must be larger than the source column data. Iv. >= : The target column data must be greater than or equal to the source column data. V. < : The target column data must be smaller than the source column data. Vi. <= : The target column data must be smaller than or equal to the source column data.
- Step2: “source fields” : select the source column to be compared with the target column.
4. Configure partitions
Set the partition configuration for the Source and target datasets. The partition size is the minimum unit of data in the Hive database, which is used to split the data you want to calculate
Done File Path Format of the Done file pathConfiguration of 5.
Set the information required for a measure. Organization is the meaning of the measure group, and you can then manage the measure dashboard as a group.6.Measure measurement information
After creating a new accuracy measure, examine the measure you have created by selecting it on the metrics page listedSuch as:
Assuming that source table A has 1000 records and target table B has only 999 records that perfectly match A in the selected fields, then accuracy =999/1000*100%=99.9%.
2.3 the Create Job
Click Jobs and choose Create Job. You can submit jobs to perform measures periodically
Currently, UI only supports simple periodic measure jobs.Fill in the job configuration block.
- Job name: The name of a job setting job that can be submitted.
- Measure name: The name of the measure to schedule. You need to select it from the list of metrics you created earlier.
- Cron Expression: The Cron Expression of the scheduler. For example, 0 0/4 * * *.
- Start: Data segment start time and trigger time comparison
- End: compares the End time of the data segment with the trigger time.
Once the job is submitted, Apache Griffin schedules the job in the background, and when the calculation is complete, the results can be viewed on a monitoring dashboard.
3. Indicator dashboard
Once the processing is complete, there are three ways to present the data graph.
1. Click Health. The heatmap of indicators is displayed.2. “DQ Metrics”You can view the indicator iconClick on the chart to get a larger picture of it and to see the metrics for the selected time window.The indicators are displayed on the right of the page. Click metrics to get a chart and detailed information about the results of your metrics.