Apache Griffin is a data quality service platform (DQSP) built on Apache Hadoop and Apache Spark. It provides a comprehensive framework to handle diverse tasks, such as defining data quality models, performing data quality measurements, automating data analysis and validation, and unifying data quality visualization across multiple data systems. It aims to address the challenges in the field of data quality in big data applications.
A background.
There is an unavoidable problem in the application of big data, that is, the measurement of data quality. In response, different teams have built custom tools to detect and analyze data quality issues in their fields. The Apache Griffin platform is therefore intended to provide a shared infrastructure and common capabilities that address common pain points of data quality and help build a trusted data asset.
Currently, it is difficult and costly to verify data quality when there is a large amount of related data flowing across multiple platforms (streaming and batch). EBay’s real-time personalization platform, for example, validates the data quality of about 600 million records a day. In this complex and large-scale environment, data quality is often a challenge.
The following problems were encountered in eBay’s data quality:
- Lack of an end-to-end, unified view of data quality from multiple data sources to the target application that takes into account the lineage of data (that is, the full lifecycle of data, including where the data comes from and where it moves over time). It takes a long time to identify and fix data quality issues.
- Lack of a unified system for measuring the data quality of streaming data through self-service. The system should be a combination of components in which you can register data sets, define data quality models, visualize and monitor data quality using simple tools, and alert the team when problems are detected.
- Lack of shared platforms and exposed API services. Each team should not reinvent the wheel itself, nor should it have to apply and manage its own hardware and software infrastructure to solve this common problem.
With these issues in mind, Apache Griffin, a data quality service, was created to address these shortcomings.
Two. Frame structure
Apache Griffin includes:
Data Quality Modeling Engine: Apache Griffin is a model-driven solution that allows users to perform their data quality validation by selecting a variety of data quality dimensions based on selected target or source datasets. It has corresponding library support on the back end and currently supports the following metrics:
- Accuracy – Reflects real-world objects or verifiable sources as data
- Integrity – Retain all necessary data
- Validity – Corrects all data values in the data domain specified by the business
- Timeliness – Keeping data available when needed
- Exception detection – a pre-built algorithm function used to identify items, events, or observations that do not fit the expected pattern or other items in the data set
- Data analysis – Perform statistical analysis and evaluation of data values in a data set to ensure consistency, uniqueness and logic.
Data Collection layer:
Apache Griffin supports two data sources, batch data and real-time data.
For batch mode, data sources can be collected from the Hadoop platform through various data connectors.
For real-time mode, you can connect to a messaging system like Kafka for near-real-time analysis.
Data processing and storage layer:
For batch analysis, the data quality model calculates data quality indicators in the Spark cluster based on data sources in Hadoop.
For near-real-time analysis, data from the messaging system is used, and the data quality model will then calculate real-time data quality metrics based on the Spark cluster. For data stores, you can use Elasticsearch on the back end to satisfy front-end requests.
Apache Griffin services:
The project provides Restful services to perform all of Apache Griffin’s functions, such as exploring datasets, creating data quality metrics, publishing metrics, retrieving metrics, adding subscriptions, and more. Therefore, developers can develop their own user interface services based on these Web.
List of supported functions
The current version supports the following DQ functions:
- Data asset detection: Apache Griffin can use the Hive Metastore service to detect Hive table metadata after enabling the configuration in the service module.
- Metrics Management: Performing operations on the UI, users can create, delete, and update three types of metrics: accuracy, profiling, and publishing metrics. However, by invoking the service API, users can create, delete, and update six categories of metrics: accuracy, profiling, timeliness, uniqueness, completeness, and publication metrics.
- Job management: Users can create and delete jobs to schedule batch jobs, calculate metrics, data ranges for each calculation, and additional trigger conditions such as “complete files” on HDFS.
- Spark cluster-based indicator calculation: The service module starts and submits calculation jobs to the Spark cluster using Livy, and the Measure module calculates metrics by default and persists them to ElasticSearch.
- Metrics Visualization: Through the service API, you can retrieve the metrics for each job from ElasticSearch. Accuracy metrics are presented as charts, and analysis metrics are presented as tables on the UI.
Future features to be upgraded:
- Apache Griffin supports multiple data source types: Currently, Apache Griffin supports only Hive tables, avro files in HDFS as batch data sources, and Kafka as stream data sources. The project plans to support more data source types such as RDBM and ElasticSearch.
- Support more data quality dimensions: Apache Griffin needs to support more data quality dimensions, such as consistency and validity.
- Exception detection: Apache Griffin plans to support exception detection by analyzing metrics calculated from ElasticSearch.