One, foreword
In medium and large projects, once there is a large amount of data, partners should know that they should split the data. There are vertical and horizontal types.
Vertical split is relatively simple, that is, the original database, after a large amount of data, from the business perspective to split multiple libraries. As shown below, the order library and user library are separated separately.
The concept of horizontal splitting is to split a large amount of data in the same service horizontally.
In the figure above, the order data reaches 40 million. We also know that the recommended storage capacity of mysql single table is millions. If we do not process the data, mysql single table data is too large, which will lead to slow performance. Use scheme can refer to the data for horizontal split. Split 40 million data into 4 tables or more. Of course, can also be divided into libraries, and then divided into tables; Separate stress from the database hierarchy.
Two, separate database and table scheme
There are commonly used schemes in the scheme of sub-library and sub-table, such as hash and range schemes. The most important scheme is the routing algorithm, which stores the routing key according to the specified algorithm. Let’s introduce the characteristics of the two schemes.
1. Hash mod scheme
Before we design the system, we can first estimate the order quantity of these years, such as 40 million. We can accommodate 10 million for each table, or we can design 4 tables for storage.
So how do you actually route the storage? The hash scheme is to modulo the total number of sub-tables for the specified routing key (such as ID). In the figure above, modulo 4 for the order with ID =12, that is, to get 0, the order will be placed in table 0. If the order id=13 is modulo 1, it will be placed in table 1. The reason why I’m modulo 4 is because the total number of subtables is 4.
- Advantages:
The order data can be evenly distributed among the four tables so that there will be no hot spots when the order is processed.
Hotspot: Indicates that the operations on an order are concentrated in one table, and the operations on other tables are rare.
One of the characteristics of orders is the time attribute. When general users operate order data, they will focus on the orders generated during this period. If all orders generated during this period are in the same order table, then hot spots will form and the pressure on that table will be higher.
- Disadvantages:
Future data migration and expansion will be difficult.
For example: The business is developing very well and the order volume is too large to exceed 40 million, so we need to increase the number of sub-tables. If we add four tables
Once we increase the total number of sub-tables, the base of taking modules will become 8. The previous order with ID =12 will be queried in table 4 according to this scheme, but the previous order was in table 0, so the data cannot be found. Because the cardinality of the modulo has changed.
Faced with this situation, our partners came up with a plan to do data migration, remaking a hash scheme for the previous 40 million data and putting it into a new planning sub-table. So we’re going to do data migration. This is a painful thing. Some small companies can accept migration outages at night, but large companies do not allow downtime for data migration.
Of course, data migration can be combined with the business of your own company to make a tool, but it also brings a lot of work, and every time you expand the capacity, you have to do data migration
Is there a solution that does not require data migration? Let’s look at the following solution
2, range of solutions
The range scheme splits data by range.
Range scheme is relatively simple, that is, the orders within a certain range are stored in a table; As shown in the figure above, id=12 is placed in table 0, and id=13 million is placed in table 1. The design of this scheme is to design the scope of the table. Routes are stored by ID.
- advantages
Our friends think about whether this plan is conducive to future expansion, without data migration. Even if 4 more tables are added, the range of the previous 4 tables does not need to be changed. The id=12 is still in table 0, and the ID =13 million is still in table 1. The range of the 4 new tables must be greater than the range after 40 million.
- disadvantages
A hot topic, we think about it, because the value of id has been increasing, the orders for this period of time is will always be in a table, such as id = 10 million ~ between id = 20 million, this time the order is will focus on the table, the result is 1 table overheating, pressure is too big, no pressure while the rest of the table.
3. Summary:
Hash mode: No hotspot problems exist, but data migration is painful
Range solution: Does not need to migrate data, but has hot issues.
What plan can achieve the advantages of the combination of the two? , that is, do not need to migrate data, and can solve the problem of data hotspot?
In fact, there is a realistic requirement, can the server performance and storage, can be adjusted properly evenly?
Third, the scheme idea
Hash can solve the problem of data uniformity, and range can solve the problem of data migration, so can we combine the two? What about using the properties of both?
Let’s consider that the expansion of data means that the value of routing key (such as ID) becomes larger, which is certain. Then, when the data becomes larger, we first use the range scheme to make the data fall into a range. In this way, when the ID becomes larger, the previous data does not need to be migrated.
But also to consider the data uniformity, that can be within a certain range of data uniformity? Because each time we expand the capacity, we will design the scope of expansion in advance, so we just need to ensure that the data within the scope is even.
Iv. Program design
Let’s first define a group concept, which contains some sub-libraries and sub-tables, as shown in the figure below
Here are a few key points:
1) If id=0 ~ 40 million, it must fall into group01
2) group01 has 3 DB’s, how does one ID route to which DB?
3) Locate DB according to the modulus of the hash, what is the modulus? The modulus should be the number of tables in the DB of all the group. The total number of tables in the figure above is 10. Why go to table totals? Instead of the total DB of 3?
If id=1, id%10=2; So if it’s 2, which DB library does it fall into? This is the design is pre-set, how is it set?
5) Once the design locates which DB, we need to determine which table falls in the DB?
5. Core main process
Following the process above, we can locate an ID based on this rule, and we can see if hot spots are avoided.
We look at the id within [0100] 00000, according to the above process design, the id is uniform distribution to within 10 million DB_0, DB_1, DB_2 Table_0 of the three database tables, why can even, because we use the hash scheme, modulus of 10.
Why mod the total number of tables, 10, but not DB, 3? DB_0 = 4 tables and DB_1 = 3 tables.
When we arrange servers, some servers with high performance and high storage can be arranged to store more data, while some servers with low performance can put less data. If we take the modulo according to the total number of DB 3, it means that [0, 40 million] data is evenly distributed among 3 DB, which can not be properly allocated according to the server capacity.
According to the total number of Table 10, it can be achieved. How to achieve it
If the value is 0,1,2,3, route to DB_0, [4,5,6] to DB_1, [7,8,9] to DB_2. In this way, more data can be placed in DB_0, and the other two DB data can be less. DB_0 is responsible for 4/10 of the data volume, DB_1 for 3/10 of the data volume, and DB_2 for 3/10 of the data volume. Group01 as a whole bears the amount of [0, 40 million] data.
Note: do not be confused by DB_1 or DB_2 table range is 0 ~ 40 million, this is the range range, that is, id in which range, which table.
The above section addresses the problem of hot spots and how data volumes can be allocated according to server metrics.
Vi. How to expand capacity
In fact, the above design idea is understood, capacity expansion has come out; Create a group02 group and define the data range of the group02 group.
Since it is a newly added group01 group, there is no concept of data migration. It is completely a newly added group group, and this group group still prevents hot spots, i.e. [40 million, 55 million] data, evenly distributed into the table_0 table of three DB. [55 million ~ 70 million] data is evenly distributed in Table TABLE_1.
Vii. System design
The idea is determined, the design is relatively simple, three tables, the group, DB, table to establish a good association between the relationship on the line.
Group and DB
Table and DB
The above table association is actually relatively simple, as long as the principle of thinking straightened out, it is OK. Instead of querying three associated tables at a time, you can save them in the cache (local JVM cache), which will not affect performance.
Once the capacity needs to be expanded, do you need to add the group02 association? Will the application service need to be restarted?
To make it easier, just configure it in the morning and restart the application service. But if it is a large company, it is not allowed, because there are orders in the early morning. So what to do? How does the local cache update?
So far, the overall program introduction is over, I hope to be helpful to my friends. Thanks!!