Introduction to the
ShardingSphere is a set of excellent open source distributed database middleware solutions, involving the system implementation of sub-database and sub-table. It is one of the solutions worth considering. More excellent points, directly bring the portal.
purpose
Data needs to be divided into databases and tables, which is bound to mean that the amount and increment of data have exceeded the bearing range of a single instance database, and the cost of hardware upgrade will become higher and higher. By dispersing data to multiple database services, the data pressure of a single database and a single table will be reduced and the query response range will be good.
The selection of sharding strategy for sublibrary and subtable determines whether the capacity can be expanded quickly in the future. Taking the common residing as sharding modulus as an example, the algorithm determines the number of libraries and tables to be prepared. Because it is difficult to evaluate the future business development, library and table preparation of more resources is a waste, less preparation, and then expand and re-do data sharding will be a terrible plan, without adequate preparation and exercise, dare not easily implement, and the increasing data increment pressure on every nerve. So, is there an implementation that smooths out database and table scaling as cheaply as possible?
Fragment modulus algorithm
ShardingSphere data sharding, take mod mod algorithm as an example. In the early stage, we planned 4 libraries, which were written to DB_0 to DB_3 by id mod mod
The company’s business progresses rapidly, the number of users grows rapidly, the amount of single table data is increasing, the need to increase the database. This time I will meet a very serious problem, increase the library after the modulus value change, need for historical data to modulus data migration, a barrage of challenges head on and come, can’t stop, can’t influence the normal conduct of business, to migrate to the database when any operation is a thrilling (even did all kinds of drill, Still worried about unexpected conditions), the failure is directly P0 failure.
When two libraries are added, the data module is changed:
How to safely and smoothly expand capacity at low cost
Now our problem has been very clear, database expansion, the most critical impact point is the choice of sharding algorithm, our entry point is how to design data sharding algorithm. ShardingSphere data sharding algorithm, accurate sharding algorithm we can use this algorithm to specify which database table the data finally falls to.
Segmented shard
The first version of sharding algorithm design, scope sharding, initially I plan to fragment, that is, to a certain section of the sharding key to the corresponding database, for example: 1 to 1 million falls into DB_0, 1 to 2 million falls into DB_1…. The database corresponding to the ID is divided by analogy.
This approach works perfectly, but there is a problem with database hot writes.
User-defined fragment key generation policy
Back to the fragment key problem, to solve the hotspot write problem, we still have to deal with how to balance the write to the database as much as possible, to relieve the hotspot pressure, so we still need to start from the fragment key generation strategy. Since the precision sharding algorithm is used, it is accurate to control which library and which table the data will end up in. The sharding key design is as follows:
We hope that ID has a certain business meaning. The business prefix is fixed at 4 bits, 6-10 bits for the branch database bit and the branch table position, followed by the time and increment (the increment has the potential of certain incremental data leakage risk, and the branch database and table bit will not repeat later). The specific data storage location in the example is table 03 of library 01. Taking 4 libraries and 4 tables (4 databases, 4 tables per database) as an example, shard keys are generated as:
Limit the generation range to 0101-0404 to cover all libraries and tables. When we need to expand, increase the database, increase the generation range can be satisfied. The problem that needs to be noted is that the number of data tables created in the newly added database needs to be consistent (for example, the shard table of each database is unified to 4 tables, and the increase of tables requires all libraries to be synchronized to the corresponding number). Here we add two new libraries as an example to show the changes after adding databases:
This allows us to smooth out the expansion of libraries and tables with minimal cost.
Data distribution is uneven
According to the above logic, the problem of database expansion is solved. However, there is a new problem. The data is still scattered to each database and each table while the fragment key is generated repeatedly. The reason why we need to expand the database is that the data volume of the original database table has reached the critical value of planned expansion, so we can expand the database. At this point we want to reduce the data write weight of the original database. My idea here is to make a dynamic weight adjustment for the generation of shard keys. The starting point is to calculate the amount of data in each table of each library to adjust and re-initialize the data shard key ring.
We planned to have 3 million data for a single table (depending on the upper limit of the database server performance evaluation table), and when the table exceeded the planned threshold, we started to reduce the write weight of the table,
- The first threshold is 60% (1.8 million), reducing the weight of table data write
- The second level threshold is 70% (2.1 million), the write threshold is lowered again, the amount of data in all libraries and tables is evaluated, and database expansion begins
- The third level threshold is 85% (2.55 million), and no data is written. Here, no data is written. It is not absolute, but it is planned to reserve a certain data space for fault tolerance, and it is best to monitor the writing of data when the threshold is reached, make an alarm, and adjust the fragment key generation strategy in time to avoid data exceeding the maximum value planned.
Initialize the
With three thresholds planned, we initialize the three write weights and then generate an array that we’ll put into the ring structure.
One of them is expressed to the threshold
After the task statistics is finished, 01 of 01 library is expressed to the first threshold value, and a weight is removed when the data ring is re-initialized, and another weight is removed when it reaches the second level.
A table no longer writes data
When a table reaches the third level, no more data is written, and the ring is re-initialized to remove the table’s write weight.
Adding a Database
After adding a database, you need to re-initialize the ring and ensure that new data is written to the newly added database. You can manually trigger the re-initialization. If the task interval is not long, you can also wait for the task initialization.
Weight adjustment strategy
The weight adjustment mentioned above is judged by whether the amount of data in the table reaches the threshold. Note that the adjustment is carried out in the way of scheduled tasks, and the interval frequency can be evaluated according to the data growth.
Limitations of the scheme
Although this scheme solves the problem of expanding and increasing the database well, it also has some limitations. Firstly, it depends on the generation of data fragment key, so it is difficult to use this scheme on the basis of existing data (there is no small adjustment cost to accept new data association key). Secondly, the data fragment bit generation strategy also has a certain complexity. I will not go into details here, but there are many schemes. With the help of zK or Redis and other three-party tools, a serial number generation service can be built, which can not only ensure certain efficiency, but also ensure the uniqueness of serial number.
If it is a new line of business and it is impossible to predict the data volume in the next one or two years in terms of data increment, consider using separate database and separate table, you may wish to consider this solution to reduce the pain caused by adding libraries and tables. In the future, there is enough operation and maintenance capacity to support TiDB, and the problems brought by the database and table can be swept away. Finally, the single table operation comes to the incense, what also do not think much, go up and dry (TiDB also has a lot of pits to step on)!
conclusion
At the beginning, we considered the use of sub-database sub-table, which could not solve the three questions of operation and maintenance architecture (can you estimate the growth volume? Do you really need so many? Your previous project XXXX), the input of hardware resources also needs to be considered (I also hope to come up with an 8-core 32G server, covering up those hasty and too late to optimize the code), in line with the principle of using as much as you want. In fact, the main in order to reduce the risk of subsequent upgrades as far as possible, the early preparation is sufficient, feel able to meet the incremental demand of the next three or five years, but the speed of business development is often difficult to predict, looking at the data volume soaring day by day, liver pain, a grasp of hair, and lost a lot of….
Newcomers beg for praise and attention
A backend developer, sometimes a veteran, is a veteran.