Iv.Apache Griffin quality monitoring based on Hive Batch data

The data set

This case applies to data quality monitoring based on batch data such as Hive and HDFS.

Suppose we have a data set (demo_src) divided by hours, and we want to know what the data looks like for each hour.

For simplicity, assume that both datasets have the same schema as this:

id                      bigint                                      
age                     int                                         
desc                    string                                      
dt                      string                                      
hour                    string 
Copy the code

Dt and hour are both partitions,

Because every day we have a daily partition DT (like 20180912),

Every day we have partitions for 24 hours (e.g. 00, 01, 02,… , 23).

Environment to prepare

Prepare the environment for the Apache Griffin measurement module, including the following components:

The JDK (+ 1.8)
Hadoop (server +)
The Spark (2.2.1 +)
Hive (2.2.0)

For detailed configuration procedures for the above components, see Griffin/Griffin-doc /deploy. This article assumes that the above environments have been configured. For information about version matching, see github.com/apache/grif…

Build the Apache Griffin measurement module

Download the Apache Griffin source package here. 2. Decompress the source package.

unzip griffin-0.4. 0-source-release.zip
cd griffin-0.4. 0-source-release
Copy the code

3. Build the Apache Griffin Jar

mvn clean install
Copy the code

And move the Built Apache Griffin JAR package to the project path

mv measure/target/measure-0.4. 0.jar <work path>/griffin-measure.jar
Copy the code

Data preparation

To get started quickly, we generate a Hive table demo_src.

--create hive tables here. hql script
--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_src`(
  `id` bigint,
  `age` int,
  `desc` string) 
PARTITIONED BY ( `dt` string, `hour` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
LOCATION
  'hdfs:///griffin/data/batch/demo_src';
Copy the code

The data format looks like this:

1|18|student
2|23|engineer
3|42|cook
...
Copy the code

You can download demo data and run./gen_demo_data.sh to obtain the data source file. We then load the data into the Hive table on an hourly basis.

Define data quality indicators

Apache Griffin Environment configuration Environment configuration file: env.json

{
  "spark": {
    "log.level": "WARN"
  },
  "sinks": [{"type": "console"
    },
    {
      "type": "hdfs"."config": {
        "path": "hdfs:///griffin/persist"}}, {"type": "elasticsearch"."config": {
        "method": "post"."api": "http://es:9200/griffin/accuracy"}}}]Copy the code

Define the Griffin Data Quality (DQ) DQ profile: Dq.json

{
  "name": "batch_prof"."process.type": "batch"."data.sources": [{"name": "src"."baseline": true."connectors": [{"type": "hive"."version": "1.2"."config": {
            "database": "default"."table.name": "demo_tgt"}}]}],"evaluate.rule": {
    "rules": [{"dsl.type": "griffin-dsl"."dq.type": "profiling"."out.dataframe.name": "prof"."rule": "src.id.count() AS id_count, src.age.max() AS age_max, src.desc.length().max() AS desc_length_max"."out": [{"type": "metric"."name": "prof"}]}]},"sinks": ["CONSOLE"."HDFS"]}Copy the code

Quality of measured data

Submit the measurement job to Spark with the configuration file path as a parameter.

spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default\ -driver-memory 1g --executor-memory 1g --num-executors 2 \
<path>/griffin-measure.jar \
<path>/env.json <path>/dq.json
Copy the code

Report data quality indicators

The compute log is available in the console, and when the job is complete, the resulting metrics can be printed. Indicators will be stored in HDFS: HDFS: / / / griffin/persist / / / _METRICS.

Optimize data quality reporting

Data quality measures can also be further improved based on results, as well as actual business needs

For details about the parameters, see Griffin/Griffin -doc/measure

Iv.Apache Griffin quality monitoring based on Hive Batch data

The data set

Environment to prepare

Build the Apache Griffin measurement module

Data preparation

Define data quality indicators

Quality of measured data

Report data quality indicators

Optimize data quality reporting

Related Posts

Turing Education Recommendation: Programmer’s August Newsletter

How to analyze product orders?

PyTorch Distributed elastic training (3)– proxy