Author: architecture black | WeChat public number: on the top of the architect
The background,
With the growth of the company’s business, if there are more than 10 million orders every day, there will be about 1 billion orders in 3 months. The previous database in the form of single database and single table can no longer meet the business needs, and the database transformation is imminent.
Ii. How to divide order data
We can divide order data into two types: hot data and cold data.
-
Hot data: order data within 3 months, high real-time query;
-
Cold data A: The order data from 3 to 12 months ago is not frequently queried.
-
Cold data B: The order data of one year ago will hardly be queried, with only occasional query requirements;
There may be a confusion here why cold data is divided into two categories, because according to the requirements of actual scenarios, users will not look at the data one year ago, if this part of data is stored in the DB, it will be very expensive, and it is not easy to maintain. In addition, if you really encounter individual users need to view the order information one year ago, you can let the user go offline data view.
The storage plan for the three types of data is as follows:
-
Hot data: use mysql for storage, of course, need to separate database and table;
-
Cold data A: For this kind of data can be stored in ES, and relatively fast query can be achieved by using the characteristics of search engines.
-
Cold data B: Data that is not frequently queried can be stored in Hive.
MySql > select * from database
3.1. Split by business
In the initial business phase, many applications adopt a centralized architecture in order to bring applications online and iterate quickly. However, with the expansion of the business system, the system board becomes more and more complex, more and more difficult to maintain, the development efficiency becomes lower and lower, and the consumption of resources becomes larger and larger, and the cost of improving the system performance through hardware will become higher.
Generally, a general e-commerce platform contains several modules, such as users, goods and orders. A simple way is to build four tables in the same library, as shown in the figure below:
However, as the business has grown, it has become increasingly difficult to maintain all the businesses in one library, so we recommend that different businesses be placed in different libraries, as shown in the following figure:
As can be seen from the figure, we put different businesses into different libraries, and dispersed all the original pressure from the same library into different libraries, thus improving the throughput of the system.
3.2 Sub-database and sub-table
We know that no matter how good it configuration every machine has its own physical limits, so that when we apply already reach or beyond single machine a limit of time, we only looking for other machines to help or to continue to upgrade our hardware, but still common solution by adding more machines to common pressure.
We also have to consider as our business logic grows, can our machines grow linearly to meet demand? Therefore, the use of database sub-database sub-table, can immediately improve the performance of the system, about why to use the database sub-database sub-table other reasons are not described here, mainly about the specific implementation strategy.
(1) Table splitting strategy
Let’s take an order table as an example. In an order table, the order ID is definitely not repeatable, so using this field as a Shard key is appropriate, as is the case with other tables. Suppose the order table has the following fields:
1create table order(2 order_id bigint(11) ,3
Assuming we estimate that a single library needs to allocate 100 tables to meet our business requirements, we can simply model which sub-table the order is in, for example, order_id % 100,
If I select order_ID from order_id, but I want to select order_id from order_id, I cannot locate the subtable. If I select order_ID from order_id, I need to locate the subtable according to the shard key. If you really want to query orders based on user_id, you should set the shard key to user_id and change the table rules accordingly to user_id % 100;
(1) Implementation strategy of branch library
Database partition table can solve the efficiency problem of data query when a single table has a large amount of data, but it cannot improve the efficiency of the concurrent operation of the database, because the essence of the partition table is still carried out on a database, which is easily limited by the IO performance of the database.
Therefore, how to evenly distribute the IO performance of the database, it is obvious that the database operation can solve the performance problem of a single database.
The implementation of branch library strategy is very similar to that of branch table strategy. The simplest one can be routed by taking modules.
Let’s take the order table as an example,
For example, order_id % database capacity,
If order_id is not an integer, hash can be used first.
For example, hash(order_id) % Database capacity
(3) Strategy of combining database and table
Database sub-table can solve the query performance of massive data in a single table, and sub-database can solve the concurrent access pressure of a single database. Sometimes, we need to consider these two problems at the same time, therefore, we need to separate the table operation of a single table, but also need to carry out a separate operation, in order to expand the concurrent processing capacity of the system and improve the query performance of a single table, that is, we use the separate database and table.
Order_id is a subtable that can be split into subtables by adding an intermediate variable. The formula is as follows:
Intermediate variable = shard key % (number of libraries * number of tables in a single library); 2 Library number = round (intermediate variable/single
For example, if there are 10 tables in the database and 100 tables in each database, and the user order_id = 1001, according to the above routing policy, we can get:
In this case, order_id = 1001 will be routed to table 2 of the first database (index 0 stands for 1, and so on).
Iii. Overall architecture design
From the figure, we divide the requests into read and write requests. Write requests are relatively simple and can be written into db according to the rules of database and table.
For the read request, we need to calculate the data query is hot or cold, general order_id generate rules as follows, “merchant location area code + timestamp, random number”, we can according to the calculated timestamp is hot or cold data query, (specific business needs specific treatment, of course, here no longer detail)
In addition, cold data in the architecture diagram refers to data generated 3 to 12 months ago. To query data generated one year ago, you are advised to query hive data offline.
In the figure, a scheduled Job is used to periodically migrate order data. Cold data needs to be migrated to ES and Hive respectively.