Warehouse metadata management

1. Metadata definition

Meta Data in a narrow sense can be understood as Data describing Data, and in a broad sense it refers to business Data except for direct read and write processing of business logic. All the information and Data required to maintain the operation of the whole system can be called metadata.

What to make of the above sentence? Let me tell you something about my son. If you haven’t seen the cartoon, you certainly can’t understand it. But how about the following description?

Xiao Tou son is a 150cm tall boy, weighing 100 catties. He was born in 2008 and is now 13 years old. His father’s name is Xiao Tou Dad, now he is in middle school.

In the description of the small head son, so how to describe? From height, weight, sex, family relationship, age, etc. This described information can be understood as metadata.

Similarly, if I tell you the number 175, you can be sure that it has some measure significance, but you still have a lot of questions, such as what is measured, who is measured and so on, but what if you measure according to the following information?

The data type The data content
The data values 175
unit cm
indicators Average height
Statistical time In 2020,
area The national
Scope of the crowd Adult male
The scope of 80-260.
Database type Mysql
Database link JDBC: mysql: / / 127.0.0.1:3306 / info
The name of the table human_stat
field height_avg
founder Li si
Creation time 2020-10-01
Modify the time 2020-10-01
Data access public
Security level security
Quality grade Very high

Does that make it clear? 175 means: In 2020 statistics of the adult male average height, the value of a reasonable range is 80-260 cm, the existing MySQL data, access is JDBC: MySQL: / / 127.0.0.1:3306 / info, by the National Bureau of Statistics of the bill was founded in October 1, 2020, The data is currently open, very safe, and the quality has been confirmed multiple times.

What descriptive terms are used in the table above to describe the data 175? Unit, indicator, statistical time, statistical range, reasonable range, database, table, field, creator, creation time, data permission, quality level, etc. These are all describing the number 175. We call the other data that describes the number 175 “metadata.”

Metadata types

Metadata can be divided into technical metadata, business metadata, and management process metadata.

  1. Technical metadata

IT describes data related to data warehouse development, management and maintenance, including data source information, data transformation description, data warehouse model, data cleaning and updating rules, data mapping and access permissions, etc. Such as table structure, file path/format.

  1. Business metadata

It serves management and business analysts to describe data from a business perspective, including business terms, what data is in the data warehouse, the location of data and the availability of data, and helps business people better understand what data is available in the data warehouse and how to use it. Such as the responsible person, the business of attribution, blood relationship.

  1. Manage process metadata

Refers to the data describing concepts, relationships and rules related to the management field, mainly including management process, personnel organization, role responsibilities and other information; For example, the number of rows, size and update time of the table every day.

For the types of metadata that are easier to understand, the metadata types in the above examples are classified and divided:

Business metadata

The data type The data content
The data values 175
unit cm
indicators Average height
Statistical time In 2020,
area The national
Scope of the crowd Adult male
The scope of 80-260.

Technical metadata

The data type The data content
Database type Mysql
Database link JDBC: mysql: / / 127.0.0.1:3306 / info
The name of the table human_stat
field height_avg

Manage process metadata

The data type The data content
founder Li si
Creation time 2020-10-01
Modify the time 2020-10-01
Data access public
Security level security
Quality grade Very high

Metadata function

  1. Blood analysis

The data warehouse is built on Hive. The original data of Hive comes from production systems and may export the calculation results to external storage. Therefore, Hive tables, mysql tables, hbase tables, and BI reports are considered tables. The relationship between these “tables” is a DAG, or blood relationship.

With kinship, optimized visualization allows users to see the upstream and downstream of a table, making it easier to find tables. There are many things that can be done based on kinship, for example:

  • Combined with the update time of the table, the critical path of scheduling DAG can also be found to help locate performance bottlenecks.
  • When the table changes, you can notify the downstream responsible person, and automatically do SQL static check for downstream tasks;
  • Assist life cycle management by finding tables/fields that are not being used;
  • Assist in maintaining consistency of fields, such as annotation and validation rule reuse.
  1. The impact analysis

Trace down the impact of metadata objects downstream. Impact analysis makes it easy to deal with the possible impact of changes, automatically identify dependencies and potential impacts, track all objects and their dependencies, and finally provide a visual display of the full life cycle of the data. For example, if you are going to change “Sales” from including taxes to excluding taxes in one of your information systems, the SE-DWA will automatically display all the “Sales Amount” fields used so that you can determine what needs to be done and recommend that you do it before making the change.

  1. Sync check

Check whether the data structure from the source table to the target table has changed.

  1. Index consistency analysis

Periodically analyze whether the definition of indicators is consistent with the actual situation.

  1. Entity associated query

The fact table is automatically associated with the proxy key of the dimension table

4. Metadata application

  1. ETL Automated management: Use metadata information to automatically generate physical models, ETL program scripts, task dependencies, and schedulers.
  2. Data quality management: Data quality measurement using data quality rule metadata. Data quality Helps you filter out problematic data and intelligently analyze data quality defects based on set rules.
  3. Data security management: Security goes a step further by limiting the access of certain group members to certain data in a table.
  4. Data standards management: Use metadata information to generate standard dimensional models.
  5. Data interface management: Manages interfaces in a unified manner using metadata information. Multiple data source access, and provide a variety of plug-ins to connect to the most popular source systems. It should be easy to get the data.
  6. Project document management: Using metadata to automatically and easily generate robust and comprehensive project documents to help you meet various requirements for data compliance. Read the metadata model and generate a description file in PDF format. Generate documentation You view the name, Settings, description, and code for each object.
  7. Data semantic management: Challenges faced by business users in self-service analysis they do not understand the data warehouse and therefore cannot interpret the data properly. Metadata can be used to model the semantic layer and transform the data using descriptions that are easy for business users to understand.

Five, the summary

It can be seen from the above that Meta Data not only records the definition of the model in the Data warehouse, the mapping relationship between different levels, the Data state of the Data warehouse and the task running state of ETL. Metadata connects the loose components of a data warehouse system to form an overall data warehouse solution.

One of the main steps in building a data warehouse is ETL. Metadata will play an important role in defining the mapping from the source data system to the data warehouse, the rules of data transformation, the logical structure of the data warehouse, the rules of data update, the history of data import, and the loading cycle. Metadata is how data extraction and transformation specialists and data warehouse administrators efficiently build data warehouses.

When using data warehouse, users can access data through metadata, define the meaning of data items and customize reports. The size and complexity of a data warehouse depends on proper metadata management, including adding or removing external data sources, changing data cleansing methods, controlling errant queries, and scheduling backups.

Data warehouse upstream and downstream and upstream and downstream engagements

Based on the characteristics and positioning of data warehouse, data warehouse strongly depends on the upstream business system, and the downstream report and visualization platform strongly depends on data warehouse.

1. Upstream contract of counting warehouse

The most important data source of data warehouse mainly comes from business system, so data needs to be continuously imported into data warehouse from business system. Changes in the upstream business system will have a very direct impact on the downstream. So it’s extremely important to have strict upstream conventions.

Common conventions are as follows:

  • Table structure change
  • Enumeration values of table fields define changes
  • The create_time and update_time fields are used to synchronize service data with the data store system and trigger new and modified tongues
  • Is_delete and IS_Valid definition specifications

Second, the downstream agreement of the number warehouse

For logarithmic warehouse, the downstream system is generally used for interfaces, reports, visual platforms and so on. Therefore, the optimization and change of the warehouse platform need to be timely communicated and synchronized.

If you think the author wrote well! You can follow the author’s public account “White Programmer’s self-study room” to get more content. Writing is not easy, you can also like, follow, comment to give an encouragement, hahaha.