background

I have recently published two articles on sub-tables:

  • Discussion on the practice of stepping pit in a sub-table
  • Two or three things to pay attention to after the table

As you can see from the title, we only made sub-tables; Or because of business development, up to now also do sub-database, it seems to be relatively smooth, so by the brain still remember a replay.

First, let’s review the process of the entire sub-database sub-table as follows:

The whole process is easy to understand and fits in with the direction most companies are going.

Few businesses will be designed as sub-database sub-table at the beginning, although this will reduce the subsequent pit, but some companies are mainly business at the beginning.

Until the business development to the single table can not support, it is natural to consider the table and even split the things.

This will be a summary, and some of the previous points may be repeated.

table

First of all, what is the appropriate situation for a sub-scale?

According to my experience, when the data volume of a table has reached tens of millions or even hundreds of millions, at the same time, the daily increase of data volume is more than 2%.

Of course, these numbers are not absolute. The most important thing is that the writing and querying of this table have affected the normal business execution, such as the query speed decreased significantly and the overall IO of the database was high.

And when we talk about the scale we talk about the level scale;

So you take a large table of data and you use some routing algorithm to distribute it as evenly as possible across N small tables.

Range

There are several separate table policies, which are suitable for different scenarios.

First, we can divide the table by scope. For example, we can divide the creation time of a table by date and save it as a month table. You can also divide the primary keys of a table by range, for example, [1 to 10000] in a table, [10001 to 20000] in a table, and so on.

Such sub-table is suitable for archiving data. For example, the system only provides the query function of historical data of nearly three months by default, which is also convenient for operation. Just remove the data before march and save the backup separately).

This scheme has both advantages and disadvantages:

  • The advantage is that it extends horizontally without much intervention.
  • The downside is that there may be uneven data (such as a spike in requests in one month).

Hash

The scale by date is simple, but it is relatively narrow; After all, most of our data queries don’t want to take time.

For example, a user who wants to query information about all the orders he has generated is a very common requirement.

So we have to change the dimension of the sub-table, and the sub-table algorithm can use the mainstream hash+mod combination.

This is a classic algorithm, as is the famous HashMap for storing data.

Suppose we divide the order information of the original large table into 64 sub-tables:

Hash is a hash of the fields that we want to divide into tables, so that the hash is as uniform and non-repetitive as possible.

Of course, if the field itself is an integer and does not repeat, you can skip this step and directly Mod the sub-table index.

Select the number of sub-table

As for the number of sub-tables (64) here, there is no standard value for the specific number, which needs to be estimated according to its own business development and data increment.

According to my personal experience, at least it is necessary to ensure that the small tables after being divided will not have a large amount of single table data (such as tens of millions of levels) in the years of business development.

My preference is to increase the number of subtables as much as possible within the database’s acceptable range, since it would be painful to expand the subtable again if the smaller tables reached a bottleneck.

I have not yet gone through this step, so I will not introduce it here.

This is not a random number, however. Like HashMap, 2^n is recommended to allow for less data migration when scaling up.

Range + Hash

There’s another way of thinking about it, of course, is whether Range and Hash can be mixed.

For example, we started with a Hash sub-table, but the data growth was so huge that each sub-table quickly reached a bottleneck, and we had to expand the number of tables from 64 to 256.

But it’s hard to move data without downtime when you’re expanding, and even if it’s downtime, for how long? I don’t know.

Therefore, can we divide the Mod sub-table into the monthly table on the basis of the Mod sub-table? By virtue of the extensibility of Range itself, we do not need to consider the subsequent data migration.

It works in theory, but I haven’t used it in practice, so let me give you an idea.

Annoying data migration

After the rules are completed, the first step of the sub-table is completed. The real trouble is the data migration, or how to achieve the data migration with the least impact on the business.

Unless you do the sub-table in the first place, the data migration step is inevitable.

Here is our current practice for your reference:

  1. Once the sub-table goes online, all data is written and queried against the sub-table, so the data in the original large table must be migrated to the sub-table, otherwise, the business will be greatly affected.
  2. We estimated that a table of about 200 million to migrate, write their own migration program, probably need to take 4 to 5 days to complete the migration.
  3. During this period of time, the previous data is not visible to the user. Obviously, the service cannot accept this.
  4. Therefore, we did a compatible process: after the sub-table modification went online, all the new data was written to the sub-table, but the operation of historical data was still in the old table, so the data migration step was not needed.
  5. However, a route judgment should be made before data operation. When enough new data is generated (it is two months for us), almost all operations are performed on sub-tables. Then data migration is started from the library.
  6. Finally, all data is generated and written from the sub-table.

So far the entire sub-table operation is complete.

Business compatibility

At the same time after the sub-table also needs to be compatible with other businesses; For example, the original reporting business, paging queries, and so on, let’s see how we handle it.

The report

The first is the report, no table before the query of a table to fix, now different, from a table to N tables.

Therefore, the original query should be changed to iterate over all the sub-tables. For performance reasons, multi-threading can be used to query the sub-table data concurrently and then summarize.

However, it is not realistic to rely only on Java to do the statistical analysis of such a large amount of data. At first, you can cope with it, and then you have to use the big data platform to deal with it.

The query

The other is the query, the original paging query is certainly not available, after all, there is no sense in paging hundreds of millions of data.

You can only provide queries by subtable field, such as by order ID, and the query criteria must include this field, otherwise it will involve traversing all tables.

This is a problem with all sub-tables, unless you don’t use a relational database like MySQL.

depots

After the subtable is completed, the stress on the single table is removed, but the pressure on the database itself does not drop.

Within a month of completing the sub-table, we had an increase in database IO due to the writing of “other tables” in the database that had little to do with the business.

In other words, some unnecessary data leads to the overall business affected, which is very uneconomical.

So we moved these forms into a new database, completely separate from the existing business.

This involves several modifications:

  1. The application itself queries and writes to this data instead of calling a separateDubboService that operates on the migrated tables.
  2. Temporarily do not do data migration, so the query also according to the sub-table to do a compatible, if the query old data will be queried in the current library, new data will be calledDubboInterface for querying information.
  3. Some associated queries on these tables also have to be converted to queriesDubboInterface, can be spliced in the memory.
  4. If the amount of data is really large, it can also be synchronizedDubboThe interface switched to writing to the message queue to improve throughput.

At present, we moved this kind of form with a huge amount of data but little impact on business to a single library, the overall IO of the database decreased significantly, and the business also recovered to normal.

conclusion

Finally, we need to archive historical data. The data of N months ago should be periodically migrated to HBASE and other storage to ensure that the data in MySQL is kept in an acceptable range.

Queries of archived data rely on big data to provide services.

This sub-database sub-table is a very rare practical operation, most of the information on the Internet is in the car before the factory to change the tires.

And most of the scenes we encounter are to the highway running the car to change the tire, carelessly “car kill”!

There’s a better way to do it. Let’s discuss it in the comments section.

Your likes and shares are the biggest support for me