Hard just ClickHouse | 40000 words long ClickHouse & based on practice and tuning the whole perspective

What is a data warehouse

Literally, a data warehouse is a warehouse for storing data, which stores all kinds of data, and these data need to be organized and stored according to some structure, rules. One of the problems we have here is that the same data warehouse, is the database the same as the data warehouse?

Database vs. data warehouse

Databases are relational databases (MySQL, Oracle, PostgreSQL…). What other types of non-relational databases are primarily business data repositories? What kind of data does a data warehouse have? When talking about their differences, we usually refer to OLTP and OLAP.

OLTP: Online transaction processing (OLTP) is used to process business data, which requires high concurrency and transaction consideration
OLAP: On-Line Analytical Processing, which mainly focuses On analysis, will produce a large number of queries, and generally involves little addition, deletion and modification

Their differences, the interview will also be mentioned, mainly from a few points to talk about it.

Data warehouse is actually a system, he is not a technology, but the integration of a lot of existing technology, to better organize and manage data. The traditional warehouse is based on relational databases, followed by distributed databases like Greenplum, and a number of companies offer a full suite of hardware-based solutions. In the development of a number of traditional warehouse, due to the performance of the hardware is limited, so there are a lot of requirements, and with the hardware prices decline, the wide use of the cloud, and the mature development of big data technology, several positions of many scenes were changed, some don’t need to strictly abide by the rules, this can also be left a lot of cost.

A few years ago, the data warehouse this thing is a little mysterious, feel very high, but now, at least in the Internet company, everyone knows the data warehouse, everyone knows the data platform, everyone can say two words, has been popular. Remember that there are always a few essential interview questions:

What is a data warehouse?
What are the characteristics of a warehouse?
What is OLAP? What is OLTP? What’s the difference?
What is a zipper watch? How to achieve the zipper table?
What are the methods of synchronization?
Why do increments? How do you do increments?
What is ETL?

Current Internet number warehouse this post, feel more emphasis on business + modeling ideas, the interview is not very good at these, when recruiting last year, is to ask some basic questions, talk about the main job of the past, will ask SQL problem, really want to know about the modeling, or looking for a book helps for see, is very good.

Traditional warehouse and Internet warehouse

There should be a complete data warehouse team behind the traditional data platform to serve the business side, which is waiting for passive ways to be satisfied. The middle and low level data are basically not open to the business side, so no matter what modeling method the data model adopts, it can mainly meet the data architecture planning at that time.

With the rapid development of Internet business, people have shifted their focus from operation and analysis to data-based fine operation. When the resources are insufficient, users will cry out and some business parties will roll up their sleeves to participate in data sorting, processing and analysis.

At this time, the original data platform construction of multiple roles (data development, model design) may be transferred to other non-professional data users, training, consulting and landing, writing more suitable for the current enterprise data application of some schemes and development of some data products.

When the Internet data platform becomes free and open and everyone who uses data participates in the construction of the data system, it will basically lead to data quality problems, waste storage and resources of repeated data, diversity of caliber, inconsistent coding, naming problems and other reasons due to unprofessionalism. Data quality is becoming a particularly salient issue.

Several warehouse architecture

Now speaking of data warehouse, more and data platform or infrastructure, has been integrated into the construction of the entire infrastructure. Instead of talking about the collaboration between Hadoop’s various components, let’s talk about the hierarchical architecture of the warehouse.

When it comes to warehouse modeling, we have to mention two sets of classical theories:

Modeling paradigm

Inmon proposed a hub top-down (EDW-DM) data warehouse architecture.

Dimensional modeling

Kimball’s bus-based bottom-up (DM-DW) data warehouse architecture.

In fact, the modeling or stratification of Data warehouse is to better organize, manage and maintain Data, and the two methods will be integrated in actual development. Of course, there are some other methods, such as Data Vault model and Anchor model, which have not been applied for the time being.

Dimensional modeling, usually referred to the star model, snowflake model, star model for OLAP analysis is very convenient.

Several positions layered

Simple point, direct ODS+DM can be, all data synchronization, and then directly develop some application layer reports, this is the simplest; When the DM layer has more content, if you want to reuse it, you will split a common layer into a three-layer architecture. Recently, I read Ben Ali’s book, The Road to Big Data, in which there are a lot of storehouse related content, which is very good. After reference, the current hierarchical mode used is as follows:

According to this layered approach, our development focus is on the DWD layer, which is the detailed data layer. Here are mainly some wide tables, which store detailed data. In the DWS layer, we will aggregate data for different dimensions. According to the truth, the DWS layer is the market layer, which is generally divided according to the theme and belongs to the category of dimension modeling. Ads is the output of various reports at the application layer.

Index dictionary

As we said earlier, warehouse is a system, a construction process, it integrates many methodologies, not a new technology. Here we talk about the index system in the data warehouse, the index is not unique in the data warehouse or data platform, many scenarios will have the concept of the index.

The indicators we are talking about here are actually KPI (Key Performance Indicator).

Enterprise key Performance indicators (KPI: Key Performance Indicator (Key Performance Indicator) is a goal-based quantitative management Indicator to measure process Performance by setting, sampling, calculating and analyzing Key parameters at the input and output ends of the internal process of an organization. It is a tool to decompose the strategic objectives of an enterprise into operational work objectives and is the basis of enterprise Performance management. KPI can make the department head clear the main responsibility of the department, and on this basis, clear the performance measurement index of the department staff.

The role of the data platform is to provide support for analysis and decision-making, and to pay attention to the operation of the enterprise at all times. So how do we see how the company is doing? At the company level, there are kpIs that the company pays most attention to, such as daily activity, GMV, order volume, etc. Different departments have different KPIs to focus on, such as new users and re-subscribers, etc. With KPI, we can evaluate the performance of the department according to KPI, that is, performance. This is also a digital transformation. All management and performance are digital.

As far as data platform is concerned, indicators are a kind of metadata, and the maintenance and management of indicators are routine. The following will simply share the management of indicators – index dictionary.

Index dictionary

Index dictionary, in fact, is the management of indicators, indicators after more, in order to share and unified modification and maintenance, we will maintain all indicators in Excel. Of course, Excel is not very convenient for sharing and version control. If possible, you can develop a simple indicator management system, combined with blood relationship, it will be more convenient to track data flow.

Index coding

To facilitate search and management, we will define a set of codes for indicators

The index type

Basic indicators: indicators that cannot be further disassembled and can be directly calculated, such as “order number” and “transaction volume” derivative indicators: indicators calculated from a special dimension on the basis of the basic indicators, such as “wechat order number” and “Alipay Order number” calculation indicators: Indicators calculated from several basic indicators that cannot be disassembled from a business perspective, such as “sell-out rate” and “repurchase rate”

The business scope

Index is the most important, the statistical caliber of clear index, it is how is this index calculated, caliber unified, just won’t produce ambiguity

Index template

In addition to the above points, there are some basic indicators, such as “indicator name” and calculation formula, which constitute the template of indicators

In the past, we still had responsible departments, that is, which department is responsible for maintaining this indicator and which department is responsible for paying attention to and assuming this KPI. When it comes to metrics, it’s about dimensions, and we’ll talk about the story of dimensions later.

Sorting and management of indicators

At the beginning, it was very troublesome to sort out indicators, because in order to unify the same caliber, it was necessary to communicate and coordinate with different departments. There may also be a variety of indicators, it is necessary to judge whether this indicator is really needed, whether it can be replaced by other indicators; The relationship between indicators also needs to be clarified.

And after the first version of indicators are sorted out, it is necessary to promote and maintain, constantly iterate and continuously promote, so that all departments of the company stand in a unified perspective to pay attention to the problem.

The date dimension of the most important dimension

The date dimension is the most commonly used dimension, and the first one to initialize the platform is probably the date dimension. Here we will briefly introduce the date dimension.

What is the date dimension

In our daily life, the generation of data is related to the date. Every minute and every second will produce data, and data analysis cannot be separated from the date.

The date dimension is a fixed calendar, 365 days a year, every day, we open the computer calendar:

Some of these, we can be solidified down, like the day of the week, lunar calendar, year, month, day, holidays, we can be solidified down, we use when we analyze.

Structure of the date dimension

The date dimension can contain as much date detail as possible so that it can be used directly during analysis, as well as in conjunction with the company’s specific circumstances, such as the date format of a particular display.

Basic year quarter month Sunday information

Expand the information

In addition to the basic dates above, there is some additional information that is usually used

There may also be lunar information, lunar year, the start date and end date of the company’s custom week, etc., and all the date-related content can be added for maintenance.

Dimension initialization

Data initialization, we can use Java, Python or SQL, through the common date function can basically meet our data needs, with SQL initialization, need to use a loop control statement, such as: MySQL, PG, Hive words to combine Shell or Python to use.

Generally, there is no need to initialize the data for many years, as long as it covers the business data of the company, and the holiday information needs to be maintained in combination with the information released by The State Council every year.

About the hours

We also analyze the hourly data, and we don’t usually put it in a date table, but in a separate hourly dimension table, which can be used together when needed.

Naming conventions

You know, there’s no order. When building the data platform, in the data group, we must first develop a variety of norms, the earlier the better, and constantly monitor whether everyone is in accordance with the agreement. Once we let everyone play freely, it will waste a lot of labor cost and time cost to unify or reconstruct later. Remember, these are pits.

Here with my present company some experience, share next.

About the project

Generally speaking, the construction of data warehouse is developed according to the hierarchical model of data warehouse. There are also will be in accordance with the business line to stratification, in their respective business line re-stratification, separate development. Here I use The MaxCompute of Ali Cloud, which is the data platform provided by Ali and a complete set of development environment. It is very convenient to use and saves the trouble of self-built platform. MaxCompute has the concept of a project that was originally intended to be created directly from the design of a hierarchical model, but for some reason was changed to create projects based on lines of business. For this project name, we must think well, no matter what to design according to, need to think clearly, want to understand, after the fixed do not change, can not change.

About the root

I forget whether it’s called “root”, so write it down first, and then find a book to confirm it. The word root belongs to the standard of data warehouse construction and belongs to the category of metadata management. Oh, now all this is part of data governance.

Normally, the complete data warehouse construction includes data governance, but now the data warehouse tends to be more about data modeling, and when it comes to data governance, it is more about data specification, data management.

Back to our main character. – Roots.

When we learn English, we should have known about the root, which is the simplest and most fine-grained word, and we mainly use it to standardize the mapping between Chinese and English. Part of our business is shelves, and the English name is rack, rack is a root word, so let’s just call it rack wherever we use tables, fields, etc., not anything else. That’s what roots are for, to unify names and express the same meaning. There are many “rate” indicators in the index system, which can be disassembled into XXX+rate. The rate can be called rate, so all our indicators are called XXX+ Rate. Roots can be used to unify table names, field names, subject names, and so on.

The name of the table

Table names need to be known by their names. By table names, you can know which business domain it is, what it is used for, and what granularity of data it is.

Regular table

The regular table is the table we need to solidify, is the formal use of the table, is currently a period of time to maintain to improve the table. Specification: hierarchical prefix [DWD | DWS | ads | bi] _ _ business domain theme domain _XXX_ granularity business domains, theme domain the way we can use root enumeration is clear, constantly improve, the grain size is the same, the main is time granularity, day, month and year, week, etc., using root is defined.

In the middle of table

Intermediate table generally appear in the Job, is the Job of intermediate data stored in a temporary table, the table in the middle of the scope is limited to the current during the execution of a Job, the Job once completes, the the table in the middle of the mission is completed, can be deleted (according to the freedom to choose his company’s scene, company will be kept for a few days before the middle of the table data, used to troubleshoot issues). Specification: mid_table_name_ [0 ~ 9 | dim] table_name is the name of the target table in our task, usually a task only a target table. I’m going to add the table name here to prevent the table name conflict when I’m free to play, and at the end of the table you can choose to play freely, give some meaningful names, or simply use numbers instead, each has its own advantages and disadvantages, and choose carefully. It is common to come across tables that need to complete dimensions, and I like to use the DIM ending here.

When creating an intermediate table, add a date or timestamp if you want to keep the historical intermediate table

drop table if exists table_name;
create table_name as xxx;
Copy the code

A temporary table

Temporary table is a temporary test table, is a temporary use of the table, is temporarily save the data to see, the subsequent general no longer used table, is the table can be deleted at any time. Tmp_xxx is a table that starts with TMP, and any other name is optional. Note that the table that starts with TMP is not used for actual use, it is only for test verification.

A dimension table

Dimension tables are abstracted tables that describe classes based on underlying data. Dimension tables can be abstracted automatically from underlying tables or maintained manually. Specification: DIM_XXX dimension table, starting with DIM, followed by a description of the indicator, can be used freely.

Manual table

Manual tables are manually maintained. After manual initialization, they generally do not automatically change. Subsequent changes are also manually maintained. Generally speaking, the granularity of manual data is relatively fine. Therefore, for the time being, we put them in the DWD layer uniformly. If there are target values or other types of manual data, they will be stratified according to the actual situation. Specification: DWD _ Service domain _manual_ XXX Manual table. Add a special topic field, manual, which indicates manual table maintenance

indicators

Index naming also refers to the root, to avoid the same index, 10 people have 10 naming methods.

Specific operation combined with the actual situation of the company, the specification as soon as possible.

Data governance

The construction of generalized data warehouse includes many solutions, including data governance, which is a long-term thing throughout the whole project. Nowadays, many people think of data warehouse simply as data modeling.

Data governance involves a lot of things, which I have never done, so I will find some information on the Internet to share.

Why do data governance

As the volume of data increases and data becomes an asset, we need to better manage that data and better represent the value of that data, which requires data governance. In fact, when building the data platform, a series of problems we encounter can be solved through data governance:

The data quality is getting worse and worse, and there is a serious lag in problem detection
Lack of data standards, different department standards are not unified
The impact of data changes on the downstream is unclear and the scope of impact cannot be determined

Data Governance is a set of continuous improvement management mechanism, which usually includes Data architecture organization, Data model, policy and system formulation, technical tools, Data standards, Data quality, impact analysis, operation process, supervision and assessment process, etc.

In short, there are many processes and standards, including “metadata management”, “Master data Management” and “data quality”.

Using data governance to solve the problems we encounter in the process of using data.

Here you can refer to: So-called Data Governance

About the incremental

Many beginners or students who have not done ETL have a misunderstanding of this increment, especially when interacting with business development students, their understanding of this increment is also wrong.

Let’s start with what they thought the increment was. They think, “Delta, just take it in increments of time, incremental synchronization, you just give me the data after the increment, not always full synchronization.” According to the truth, this way of thinking is right, but not rigorous, and will be wrong, let’s see step by step.

1. What is an increment

Incremental is compared with us, they are all under the scenario of synchronous data, such as business systems on the number of data synchronization to the warehouse, warehouse of data synchronization to the business system, can use synchronous way, this is relative to our development, from the database level can also be synchronized, here we are not introduced.

Full amount of synchronization, that is, all the data synchronization in the past, 100 100 simultaneously, article 10000 synchronous 10000, 100 million 100 million, simultaneously we also should find problems in this way, the small amount of data, the full amount of synchronization simple convenient and easy to perform, and when the large amount of data, especially when historical data will not change frequently, Full synchronization wastes a lot of resources and time and seriously affects the synchronization efficiency.

Insert delete from tmp_A; insert into tmp_a xxx; Insert overwrite insert overwrite table tmp_a XXX;Copy the code

SQL syntax may be different, that’s about it, haha

Remember to delete or overwrite inserts or you’ll end up with more data.

Select a few scenarios for incremental synchronization:

There’s a lot of data, and historical data doesn’t change very often
You just need incremental data

For incremental synchronization, there are some requirements for the table. For example, you need to have the create_time and update_time fields. Create_time indicates when the record was created, and update_time indicates when the record was updated. Just take the changed data (using update_time), and note that you also need a primary key, which is used to overwrite the data.

It is related to different business scenarios. Some records will not be updated after they are created, similar to flow data. Such data can be directly incrementally taken without deletion. But some data will be updated, when the data has been synchronized over the change, the warehouse side also needs to be synchronized change.

2. How do I make increments

Incremental synchronization also requires an initialization, which is full.

Suppose we have a table like this:

create table tmp_a(
    id bigint,
    create_time datetime,
    update_time datetime
);
Copy the code

In offline scenarios, data synchronization is performed at the time of least traffic, which is usually in the early hours of midnight. Therefore, most data synchronization starts after 0 o ‘clock to synchronize yesterday’s data, which is also called T+1.

If the following four records are created on March 1, the warehouse will be synchronized in the early hours of March 2

On the 2nd, one new record was added and one record was updated. According to the increment rule, we will get two records

Once we have the delta data, we need to merge the delta data into our warehouse tables

New data, which can be directly inserted, updated data, however, we need to put the original record update, or delete first, then insert, before we will record the data into a state, if it is updated, just remember a “update”, if it is inserted just remember a “insert”, here, should know why we need to have a primary key. If there is no primary key, how do you know if the record has changed?

To use increments, two sets of tables are generally required, one for storing incremental data and one for storing complete, full data.

3. etl_insert_time

I prefer to add a timestamp field, either incremental or full, to indicate when the record was inserted, especially when comparing incremental data, which is useful for troubleshooting data problems.

4. The synchronization mechanism of our company

As for us, a start-up company, the data volume is not much, we use ali cloud tools, at the beginning for convenience, all the data, all to the full volume, just looked at the data volume and more than 10 tons, many of which are historical data.

Although we are full volume, but in order to capture the change of recorded data, using pt (partition) way, every day is a full snapshot, this is now a cheap storage processing method, simple and crude. When I first came here, I mentioned doing increments, but it was rejected. Later, no one came to do this, because there were too many tables and it would be too expensive to modify.

5. Hive-based increments

Hive is now standard, the above said incremental solution, may still be based on the relational database, in Hive, due to more powerful computing power, can not consider the amount of data, so derived from several solutions. Delete operations are not supported on Hive.

Sort (row_number)

We still get incremental data every day, and then insert the incremental data to each partition, each partition is the incremental data, data changes, of course, the record of the same primary key will appear in multiple partitions, so if we want to get the latest full version data, you can use the row_number sorting according to the primary key and time, Get the full data of the latest version

full join

Full Join is used to associate incremental data with historical full data, and then extract the latest full version of data

left join + union all

This is similar to the way of full Join, which is more beautiful and rigorous. This way was also used to do increments on GP in the past.

6. Zip tables

When it comes to incremental, it is also necessary to mention the zipper table. The zipper table used to be more, but it is rarely used by Internet companies. It is basically handled by partition. In fact, the zipper table is to record every change in the data, and it is a little troublesome to deal with it. This seems to have been written before, so I will find and paste it.

Upstream and downstream engagement

Due to the characteristics and positioning of data warehouse, it needs to strongly rely on the upstream business system, of course, there will be some downstream systems, so it is very necessary to set the upstream and downstream standards, change notification mechanism.

I feel like I’ve written upstream and downstream, but I didn’t find it, so I’ll write it again.

The upstream

What I said here is mainly based on small companies, such as the start-up company I work for now. For example, large companies with mature development, all kinds of procedures and fault-tolerant monitoring mechanisms are very perfect. For these scenarios, what I said may not be applicable.

For several positions, the most important is the data, the number of data in a warehouse, is the main source of business system, is the company a variety of business data, so the number of warehouse need to keep the business system data synchronization to their own platform, so once the upstream business system changes, changes of several positions will, otherwise, this synchronization operation is likely to fail.

Table structure change

The upstream table structure often changes, adding fields, modifying fields, and deleting fields (unless the field is really no longer used, it is usually marked deprecated). Table structure to maintain clear, best table name, field name, field type, field, all finishing clear, don’t use fields or deleted, or note, when the business frequently change or iterative optimization, it is easy to appear, I wrote a half a day’s code, finally found the table with wrong, field use is wrong, it’s embarrassing.

For this change, manual processing, that is, manually add and modify fields in the table corresponding to the data warehouse, and then modify the synchronization task; This is best done into automation, for example, automatic monitoring of upstream table structure changes, changes, automatically to modify the table structure in the warehouse, automatic modification of the synchronization task.

Enumerated values

Business system will be a lot of constants, is used to identify some state or type, the value will often new, will do with these values in several storehouse processing, such as into dimensions, will be translated into corresponding Chinese, but in fact this mapping relationship, we don’t know, only know the business development, so it’s best to allow them to maintain a table enumeration values, Let’s synchronize this table.

create_time & update_time

Normally, create_time, when this record is inserted, it doesn’t change, but at some point, ha, ha, the developer will update it; Update_time, when this record changes, this time also needs to change, some developers do not update it……

So when doing incremental operations, be sure to agree with the developer on the definition and usage scenarios of these two fields.

is_delete & is_valid

In some scenarios, we need to delete some data. Generally, we will not delete data physically, but we will delete data logically through a field. Please communicate with the developer well, use a fixed field, and make sure that both sides of the field have the same understanding, otherwise there will be many holes in the future.

The downstream

After talking about the upstream, we talk about the downstream. For the data warehouse, the general mail, reports, visual platform are all downstream, so when we carry out some reconstruction and optimization operations in the data warehouse, we also need to inform them.

The main is to do a good job of maintenance of the logarithmic warehouse model, the use of the table, field description, etc.

To the upstream requirements, they also want to do well, because they are also upstream.

Task annotation

This one about notes, notes always let people love and hate.

Without comments, who knows what your code is for? From A code point of view, what you want to do is A, but what you really want to do is B. The code has comments, also may not be able to rest easy, comments may be the original version of the requirements, changed a few versions, the code has changed, comments have not changed, annotations and code do not match, who knows which shall be the standard ah.

Our data warehouse is based on Ali Cloud, using its DataWorks as an offline tool, all the code is above this, so here is a brief introduction to the task on Ali Cloud, several annotation specifications.

Dwd_rack_machine -- @description Rack width table -- @target rack. Dwd_rack_machine -- @source owo_ods.kylin__machine_release_his -- @source owo_ods.kylin__machine_device_his -- @author yuguiyang 2017-12-25 -- @modifyCopy the code

@name: task name, our task name is generally p_ target table name, later Ali DataWorks upgrade recommended that the task name and table name be the same.

@description: indicates the description of the task and the main content of the task. @target: indicates the name of the target table. Generally, only one target table is output for a task

@source: The source table, which is the underlying table used in the task, can also be omitted. It can be seen directly from the lineage, and it is easy to miss updates

@author: creator and creation date, @modify: Content change history, who changed it, date changed it, why changed it, this can also be found in version control, but it’s a little more intuitive here.

The hard just Presto | Presto principle & tuning & interview & practical comprehensive upgrade edition

The hard just Apache Iceberg | technology research and application in the practice of companies big summary”

The hard just ClickHouse | 40000 words long ClickHouse & based on practice & tuning all perspective”

The Boy hard just | SQL data warehouse of the Gospel of the data warehouse system modeling & implementation & notice little summary”

The hard just Hive | small tuning interview, a summary of 40000 words

The hard just user picture (a) | tag system user portrait under construction of small guide

The hard just user portrait (2) | portrait of users based on the large data to build a small encyclopedia