10 billion level data after the table how paging query?

When the business scale reaches a certain scale, for example, the daily order volume of Taobao is more than 50 million, and meituan is more than 30 million. The database faces the massive data pressure, divides the database divides the table is to have to carry on the operation. After dividing the database into tables, some routine queries may cause problems, the most common is such as paging query problems. In general, we call the subtable fields shardingKey, such as the order table according to the user ID as shardingKey, so if the query conditions without user ID query how to do paging? For example, how to query more multi-dimensional queries without shardingKey?

Only the primary key

In general, the primary key of our database is increable, so the problem of primary key conflict is an inevitable problem, the simplest way is to use a unique business field as the unique primary key, such as the order table order number must be globally unique.

There are many common ways to generate unique IDS. The most common ones are Snowflake, Didi Tinyid, and Meituan Leaf. The snowflake algorithm, for example, can generate more than 4194304 ids per millisecond.

The first bit is not used, the default is 0, the 41-bit timestamp is accurate to millisecond, can hold 69 years, 10 bits of machine ID is the first 5 bits of data center ID, the lowest 5 bits of node ID, 12 bits of sequence number of each node accumulated every millisecond, can accumulate to 2^12 4096 IDS.

table

The first step is how to ensure that the order number is unique after the subtable. Now consider the subtable problem. First of all, consider the size of the sub-table according to its own business volume and increment.

For example, now our daily order is 100,000 orders, estimated a year later can reach 1 million orders, according to the business attributes, we generally support the query of orders within half a year, orders more than half a year need to be archived.

So in terms of daily order of 1 million half a year, if we don’t divide the table, our order volume will reach 1 million x18 = 180 million. If we take this data magnitude part of the table, we can’t carry a single table. Even if you can carry RT time, you can’t accept it at all. According to experience, the number of millions of single tables is not stressful for the database, so it is enough to divide 256 tables, 180 million /256≈ 700 million, or 512 tables for insurance. So if you think about it, if the volume increases another tenfold to 10 million orders a day, sub-table 1024 would be a good choice.

With the subtable and more than half a year of data archiving, the data of 700,000 in a single table is sufficient for most scenarios. The order number is then hashed, and modulo 256 is used to determine which table it falls into.

So, since the unique primary key is based on the order number, the previous queries you wrote based on the primary key ID will not work, which involves some changes to the historical query function. But this is not a matter, right, change to the order number to check on the line. That’s not the problem. The problem is what our headline says.

The c-terminal query

Said for a long time, finally to the topic, so after the sub-table query and paging query how to solve the problem?

First of all, queries with Shardingkey, such as by order number query, whether you paging or whatever can directly locate to the specific table to query, obviously there is no problem with the query.

If it is not shardingkey, for example, the above mentioned order number as shardingkey, such as apps, small programs are generally through the user ID query, then how do we do sharding by order number? Many company order tables use user IDS to do shardingKeys directly, so it’s very simple, just look up. An easy way to do that is to put the user ID attribute on the order number. For a simple example, if you don’t want to run out of 41 bits of time stamps, the user ID is 10 bits, and the order number is generated with the user ID, the order number is generated based on the hash modulus of the 10 bits of user ID in the order number, and the result is the same whether the order number is queried by the user ID or the order number.

Of course, this approach is just an example, and the specific rules for order number generation, how many bits and what factors to include, depend on your own business and implementation mechanism.

Ok, so whether you use the order number or the user ID as the shardingKey, you can solve the problem either way. Then there is the question of what if the query is neither an order number nor a user ID? The most intuitive example is from the merchant end or the background of the query, the merchant end is the merchant or the seller’s ID as the query conditions to search, the query conditions of the background may be more complex, like some of the background query conditions I encountered can have dozens of, how to search this?? Don’t worry, let’s break down the complex query on the B side and the back end.

In reality, the bulk of the real traffic is from the user end C end, so in essence, it solves the problem of the user end, the problem is solved more than half, the rest from the merchant seller end B end, background support operation business query traffic will not be very big, this problem is easy to solve.

Other End Query

There are two ways to solve the B – end non-ShardingKey query.

Double write, double write is to place the order data into two copies, C end and B end each save one, C end you can use the order number, user ID as shardingKey, B end you can use the seller’s ID as shardingKey. Some of you are going to say, well, does it affect performance if you double write? Because a slight delay is acceptable for the B end, the B end order can be placed asynchronously. You think you go to Taobao to buy an order, the seller slightly delayed one or two seconds to receive the order of the message has anything to do with it? What difference does it make when you order a takeaway and get the order a second or two late?

This is one solution. Another solution is to do an offline storehouse or ES query. After the order data is in the database, whether you use binlog or MQ message form, the data is synchronized to storehouse or ES, the order of magnitude they support is simple for this type of query. Again, there is certainly a slight delay, but a manageable range of delays is acceptable.

And for the query of the management background, such as operation, business, product need to see the data, they naturally need complex query conditions, the same walk ES or several warehouses can be done. Without this solution, and without paging queries with ShardingKey, you would have to scan the entire table for aggregated data and then do paging manually, but there are limits to what you can find.

For example, you have 256 slices. When querying, you scan all slices circularly, take 20 pieces of data from each slice, and then aggregate the data manually. It is impossible to find the full amount of data.

conclusion

The query question after the database table, for the experienced students in fact, this question is known, but I believe that in fact, most of the students do business may not have come to this order of magnitude, the database table may stay in the concept stage, the interview was asked after being at a loss, because no experience do not know how to do.

Depots table is based on the existing business first and future incremental judgment, such as the spelling of the single volume 50 million, six months data to have the level of billions, the score to 4096 form the right, but the actual operation is the same, 4096 there is no need for your business, make reasonable choice according to the business.

For shardingKey-based queries, we can solve it very easily. For non-Shardingkey queries, we can solve it by double data, number stores and ES solutions. Of course, if the amount of data after subtable is small, it is not a problem to build the index and scan the full table query.

10 billion level data after the table how paging query?

Only the primary key

table

The c-terminal query

Other End Query

conclusion

Related Posts

Spring Cloud Overview (Eureka, Ribbon, Hystrix, Zuul, Configuration Management /Config)

The technology department suddenly announced that all JAVA developers would have to know the interface automation testing framework

Scala. Associated classes and associated objects