The data set
This case applies to data quality monitoring based on batch data such as Hive and HDFS.
Suppose we have a data set (demo_src) divided by hours, and we want to know what the data looks like for each hour.
For simplicity, assume that both datasets have the same schema as this:
id bigint
age int
desc string
dt string
hour string
Copy the code
Dt and hour are both partitions,
Because every day we have a daily partition DT (like 20180912),
Every day we have partitions for 24 hours (e.g. 00, 01, 02,… , 23).
Environment to prepare
Prepare the environment for the Apache Griffin measurement module, including the following components:
- The JDK (+ 1.8)
- Hadoop (server +)
- The Spark (2.2.1 +)
- Hive (2.2.0)
For detailed configuration procedures for the above components, see Griffin/Griffin-doc /deploy. This article assumes that the above environments have been configured. For information about version matching, see github.com/apache/grif…
Build the Apache Griffin measurement module
Download the Apache Griffin source package here. 2. Decompress the source package.
unzip griffin-0.4. 0-source-release.zip
cd griffin-0.4. 0-source-release
Copy the code
3. Build the Apache Griffin Jar
mvn clean install
Copy the code
And move the Built Apache Griffin JAR package to the project path
mv measure/target/measure-0.4. 0.jar <work path>/griffin-measure.jar
Copy the code
Data preparation
To get started quickly, we generate a Hive table demo_src.
--create hive tables here. hql script
--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_src`(
`id` bigint,
`age` int,
`desc` string)
PARTITIONED BY ( `dt` string, `hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION
'hdfs:///griffin/data/batch/demo_src';
Copy the code
The data format looks like this:
1|18|student
2|23|engineer
3|42|cook
...
Copy the code
You can download demo data and run./gen_demo_data.sh to obtain the data source file. We then load the data into the Hive table on an hourly basis.
Define data quality indicators
Apache Griffin Environment configuration Environment configuration file: env.json
{
"spark": {
"log.level": "WARN"
},
"sinks": [{"type": "console"
},
{
"type": "hdfs"."config": {
"path": "hdfs:///griffin/persist"}}, {"type": "elasticsearch"."config": {
"method": "post"."api": "http://es:9200/griffin/accuracy"}}}]Copy the code
Define the Griffin Data Quality (DQ) DQ profile: Dq.json
{
"name": "batch_prof"."process.type": "batch"."data.sources": [{"name": "src"."baseline": true."connectors": [{"type": "hive"."version": "1.2"."config": {
"database": "default"."table.name": "demo_tgt"}}]}],"evaluate.rule": {
"rules": [{"dsl.type": "griffin-dsl"."dq.type": "profiling"."out.dataframe.name": "prof"."rule": "src.id.count() AS id_count, src.age.max() AS age_max, src.desc.length().max() AS desc_length_max"."out": [{"type": "metric"."name": "prof"}]}]},"sinks": ["CONSOLE"."HDFS"]}Copy the code
Quality of measured data
Submit the measurement job to Spark with the configuration file path as a parameter.
spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default\ -driver-memory 1g --executor-memory 1g --num-executors 2 \
<path>/griffin-measure.jar \
<path>/env.json <path>/dq.json
Copy the code
Report data quality indicators
The compute log is available in the console, and when the job is complete, the resulting metrics can be printed. Indicators will be stored in HDFS: HDFS: / / / griffin/persist / / / _METRICS.
Optimize data quality reporting
Data quality measures can also be further improved based on results, as well as actual business needs
For details about the parameters, see Griffin/Griffin -doc/measure