Introduction: How to design a storage policy

Writing this article about storage planning is mainly to tell you how to make a plan for storage. In the era of relational databases, storage is expensive and cherished. But in the era of big data, everyone immediately becomes the U.S. government printing dollars. It feels like it can hoard as much as it wants, but it keeps begging Congress to raise the debt ceiling again. Instead, it creates a feeling that our lives are worse than those of a relational database, or at least that they are not as anxious and miserable as we are. How come big data products, which are calculated by PB, live worse than relational databases that live by GB?

Public cloud cost calculation

In the public cloud, in order to attract new users to learn and use the product, MaxCompute actually achieves the effect of learning MaxCompute and DataWorks for 1 yuan. The cost of learning and using is very low. Even if you recharge your account with 10 yuan, you can enjoy these two products for a year. So, let’s take a look at what real user fees look like.

L Package resources

Link: Package billing (package) – MaxCompute – Aliyun

The plan contains computing and storage resources, as shown in the following table.

L Package price

L Storage cost calculation

Let’s make a simple calculation based on the table above. Use 160CU to store the 150TB package, if the excess of 100TB is 0.019 yuan *1000GB*100TB=1900 yuan per day, the monthly cost is 57,000 yuan, which is higher than the package price of 35,000 yuan.

Storage Explosion Cause

As we can see from the above example, given sufficient computing resources, the increase in storage costs is unaffordable. Moreover, computing is strongly associated with business requirements. If there are no new business requirements, the amount of data processed each day will not change significantly, but the batch system will generate a new day’s data every day.

Suppose we use MaxCompute to integrate a database of 100GB transactional system, and the time of data processing is T+1. There are two factors leading to data storage bloat.

1. Snapshot the table. A snapshot table means that the data in each partition of the MaxCompute table is a mirror of the corresponding table of the service system. Based on the integration frequency T+1, a 100 GB service system stores one snapshot in MaxCompute every day. The storage requirement for a year is (365 x 100 GB) 36.5TB.

2. Layered architecture. According to the general hierarchical architecture of data warehouse, it can be divided into at least three layers: mirror layer, middle layer and application layer. Data is copied every time it rises to a new layer after warehousing. Especially in the application layer, different applications will copy multiple copies of data for personalized processing. Let’s take a more conservative ratio of 1:1:1, and the annual storage demand increases to 36.5*3=100 TERabytes.

Therefore, a very simple conversion (reasonable but probably not precise) is obtained from the above calculation. If MaxCompute integrates a 100GB business system and uses T+1 frequency calculation, So we need 1000 times more storage per year than the source system (100GB becomes 100TB).

Store design

Through the above analysis, we know that the storage mode of the snapshot table is the biggest reason for many copies of the data. A row that has not been updated in a year is stored by snapshot and will be copied 365 times. So why does the data warehouse ETL process make heavy use of snapshot tables? In general, the structure is simple and easy to use, suitable for ETL processing.

It is impossible to eliminate snapshot tables, so you need to think about how to design storage. Storage optimization has two directions, the first is compression and the second is cleanup.

Data cleaning

Data cleansing is relatively simple. It is important to remove unnecessary data in time. Data cleaning methods include:

1. Services are offline. Check whether the application layer data tables that are no longer in use can be cleared offline.

2. Automatic recycling. Lifecycle is used to set the failure time of the partition, and the system background will automatically recycle.

ALTER TABLE table_name SET LIFECYCLEDAYS;

ALTER TABLE table_name partition[partition_spec]ENABLE|DISABLE LIFECYCLE;

3. Proactive management. This mode is used to control the production environment. Different layers adopt different policies, for example, store the latest N days + point-in-time snapshots at the end of the month. Active management Use custom scripts to periodically clear unwanted data partitions. Active management can also set storage caps on the development environment or some project space, guiding developers to design and manage storage. Request that temporary tables be released as soon as they are used.

4. Repeat storage. Some data tables are only intermediate result sets and do not need long-term storage. Developers can be advised to use views or temporary tables to replace or reduce unnecessary historical storage cycles.

Data cleansing deletes a portion of the partition’s data, which indicates that the state of the snapshot at a certain time has been erased from the history. The batch of T+1 records the last state of each row each day, but if the retention policy is month end + last N days, then the last state of the data we can get from the historical data is the last state of the month.

In this case, to what extent the historical data is cleaned, is the need for a business guidance. It is also recommended that historical data be retained at the lowest level of the warehouse hierarchy.

For example, if the data storage requirement is reserved at the end of the month + the latest 10 days, the annual storage requirement is reduced from 100TB to 100GB x (12 months + the latest 10 days) x layer 3 =6.6TB, which is 1/15 of the original policy. Instead of 150TB of 160CU storage, which can store 100GB of business system data for 1.5 years, it is now 20 years.

Data compression

Data compression refers to the use of the storage compression features of the product, or the use of zipper table structure compression storage.

L Compress the current table

1. The archive. By increasing the storage ratio from 1:3 to 1:1.5, half of the physical space can be saved, and a compression algorithm with a higher compression ratio is also adopted.

alter table my_log partition(ds=’20170101′) archive;

If the archive policy is enabled, historical data is archived. We can see that the storage will be reduced by half. We can reduce the annual storage requirement from 100 TERabytes to 50 terabytes.

L Back up to another table

1. Merge partitions. For some very small tables, the backup storage of historical data can be optimized by merging all data in an interval into one partition (mainly optimizing small files).

2. Zip watch. Zippered tables are the core data storage method that enables traditional warehouses to operate, and they can still be used in MaxCompute. The core of the zipper algorithm is to save each state change of each record as one row instead of N copies of the snapshot table, which can greatly save storage resources.

If the zippered table policy is enabled, the data store can be considered essentially equal to the source system, except that hierarchical replication needs to be computed. The annual storage requirement is reduced from 100TB to 100GB x one day x three layers =0.3TB, which is 1/300 of the original policy. Instead of 150TB of 160CU storage, which can store 100GB of business system data for 1.5 years, it is now 500 years. (This calculation is too idealistic because ETL processes are mostly implemented using snapshot tables, so you can only back up historical data that is not needed at the moment.)

A backup is equivalent to a table structure that is not consistent with the original and changes. Therefore, accessing historical data requires a different approach than using snapshot tables.

Zipper table link: Zipper table design based on MaxCompute – Ali Cloud developer community

Compression contrast

Snapshot table and zipper table

Zipper tables have lower and more reasonable storage characteristics than snapshot tables.

The structure comparison between the snapshot table and the zipper table is as follows:

• Each primary key of a zipper table stores only one row of data at all points in the storage cycle.

• Each primary key of the snapshot table stores a row of data at all points in the storage cycle.

As you can see from the above comparison, when snapshot tables are used to store data, each MaxCompute partition stores one copy of data, whereas zipper tables only store one copy of data. It looks like the storage size comparison is N: 1, where N is the number of partitions in the snapshot table and 1 is the partition size of the latest snapshot table.

Zipper table compression ratio

The compression ratios of two types of data tables, one with a low growth rate and the other with a high growth rate, are measured based on actual data. By comparison, the data stored in a zipper table for a month is only about twice the data stored in a day.

Table A: Record low growth rate table

Date: 20170501-20170531

Days: 31 days

Table B: Record high growth rate table [transaction log/event table]

Date: 20170501-20170531

Days: 31 days

The compression ratio of the stored archive

If the space in a Project is tight and data needs to be deleted or compressed, consider the archiving function of the table in MaxCompute. The effect is that the storage space can be compressed by 50%. The Archive function stores data as Raid files. Instead of simply storing three copies of data, it stores six copies of data plus three parity blocks. In this way, the storage ratio is increased from 1:3 to 1:1.5, saving half of the physical space, and a compression algorithm with a higher compression ratio is adopted. If a block of data is corrupted or a machine is corrupted, the recovery time of the block is longer than the original method, and read performance is also compromised. Therefore, this function can be used to compress some cold data. For example, if some very large log data is seldom used after a certain period of time but needs to be stored for a long time, Raid File can be used for storage.

The actual calculation results are as follows:

conclusion

Through the above introduction, we learned about the common storage scheme, to understand the snapshot table, zipper table and storage compression and historical data cleaning methods. I believe you have a certain idea in mind, how to plan and design our storage strategy.

Moreover, there is a certain conversion relationship between storage and computation. Compression requires computing resources. Without computing resources, there is only one method for storage reduction: deleting data. General batch systems are run at night, so you can schedule compressed storage tasks to be performed during the day.

The original link to this article is the original content of Ali Cloud, shall not be reproduced without permission.