preface

Talking about technical solutions without looking at specific business requirements and scenarios is like talking about empty talk. There is no single technology or solution that is absolutely good or bad. It’s a relative distinction, otherwise how did these technologies or solutions come about? Must also be to solve some kind of specific scene of the problem and produced at that time can be described as “advanced” technology.

From 0 to 1

When the system was new, the estimated volume of traffic and data was not large. Later, the business is doing better and better, the amount of data is more and more large, it is found that the single database single table can not meet the demand, the need for sub-database sub-table. The direction of your database might look like this:

Very few businesses will be designed as a sub-database sub-table at the beginning, although this will reduce the subsequent pit, but some companies are mainly business at the beginning.

Horizontal VS vertical

If a single library is too large, then we need to see if there are too many tables causing too much data, or if there is too much data in a single table. If there are too many tables and too much data, use vertical shard to split into different libraries based on the business.

If the amount of data in a single table is too large, then horizontal segmentation is used, that is, the data of the table is divided into multiple tables according to certain rules, or even multiple tables in multiple libraries. The order of the sub-database sub-table should be vertical points first, then horizontal points. Because vertical division is simpler and more in line with how we deal with real-world problems.

  • The vertical table

    That is, “big table split small table”, based on the column field. Generally, there are many fields in the table, and the uncommonly used, large data and long length (such as text type fields) are split into “extended table”. Generally for the kind of hundreds of columns of the large table, but also to avoid the query, the amount of data caused by the “cross-page” problem.

  • Vertical depots

    Vertical branch library is aimed at the separation of different businesses in a system, such as User a library, commodity Producet a library, Order Order a library.

  • The level of table

    For a single table with a large amount of data (such as order table), according to certain rules (RANGE,HASH modulus, etc.), cut into multiple tables.

  • Horizontal sub-database sub-table

    The data of a single table is shard to multiple servers, each server has the corresponding library and table, but the data set in the table is different

So you start table partitioning: a large table is distributed as evenly as possible among N smaller tables by some routing algorithm.

There are several types of table splitting strategies, which are applicable to different scenarios. For example, by creating a time, a primary key ID, or the popular hash+mod combination (a hashmap-like strategy)

The hash here is a hash of the fields that we want to split into tables, so that the hashed data is as uniform and non-repetitive as possible. Of course, if the field itself is an integer and does not repeat can also omit this step, directly Mod to get sub-table subscript can be.

Of course, the table is only the first step, but also do data migration, unless it is done at the beginning of the table. There are more pits to fill in for paging queries, groupBY, multi-table joins, and transactions.

At present, there are many sub-database and sub-table middleware on the market, which may be used to solve the problems mentioned above. For example, Atlas (360), mysql-proxy, DBProxy (modified by Meituan on Atlas), Mycat(which is the evolutionary version of Cobar, Cobar is the evolutionary version of Amoeba, both of which are alibaba’s)

The details and pits of the sub-database sub-table are not discussed here and are not the focus of this article.

Record identifier

Returning to the record identifier for the table, we have two main requirements for this identifier:

  • Globally unique
  • Trends and orderly

Here’s how:

1 Use auto_increment to generate a globally unique increment ID. This should not be a problem, if you use MySQL, this is also the official recommendation. Without talking about the advantages, let’s talk about the disadvantages:

1) Availability is difficult to ensure: the common database architecture is one master, many slaves + read and write separation; 3) Poor scalability and upper limit of performance: because the write is a single point, the write performance of the main database determines the upper limit of ID generation performance and is difficult to expandCopy the code

2. Improved method: If we adopt the design of divided database and divided table, we can set different auto_INCREMENT initial value and the same increment step for each write library. To ensure that each database generates a different ID. However, the disadvantage is the loss of the “absolute increment” of ID generation, the database write pressure is still very large, every time to generate ID to access the database. At this time, we can use a single point of batch ID to generate services, with one ID to generate services, and first generate multiple ids in batch. So the application accesses the ID generating service and asks for the ID, so the ID generating service doesn’t need to visit the database every time, which is much better, but the problem is that the service is still a single point.

3 UUID guarantees that all machines in the same space and time are unique. And can be generated locally, good scalability, basically can be considered no performance upper limit. Is a solution, but UUID also has disadvantages: it cannot guarantee the trend increase, UUID is too long, often represented by string, as the primary key index query efficiency is low (B+ tree in order to maintain balance, it will cause node page splitting and fragmentation of B+ tree).

4 Twitter SnowFlake is a very good ID generation solution, the implementation is very simple,8Byte is a Long,8Byte equals 64bit, the core code is the millisecond level time 41 bits +10 bits machine ID+ 12 bits in millisecond sequence, You can also adjust the ratio of machine bits to sequence bits in milliseconds. It is possible to implement its own distributed ID generation algorithm by borrowing the ideas of Snowflake and combining the business logic and concurrency of each company. Such as:

The disadvantage is that there is “no global clock”, and the ids assigned to each server are absolutely incrementally increasing, but globally the generated ids are only incrementally increasing (some servers are early, some are late). Meituanleaf is a SnowFlake like solution.

5 Can even use redis INCR atomic operation autoincrement, but the security is not enough.

6 use ** TiDB** If you boldly adjust the database, do not use MySQL, but use TiDB, automatic expansion, business does not care about the database and table. TiDB is an open source converged database product developed by PingCAP, which is targeted at online transaction processing/online analysis processing (HTAP). It achieves one-click horizontal scaling, strong consistency of multi-copy data security, distributed transactions, real-time OLAP and other important features. At present, it has been widely used in financial services, Internet, manufacturing and other industries. However, the problem is that the AUTO_INCREMENT ID of TiDB is only guaranteed to be self-increasing and unique, but not guaranteed to be continuously allocated. TiDB currently uses batch allocation, so if data is inserted on multiple TIdbAs at the same time, the assigned auto-increment ID will be discontinuous. When multiple threads concurrently insert data to different TIDB-servers, the self-increment ID of the inserted data may be small. In addition, TiDB allows AUTO_INCREMENT to be specified for fields of integer type, and a table allows only one column whose attribute is AUTO_INCREMENT. But while there are problems, there are also solutions. If you are interested, check out the relevant information.

Reference:

  • www.infoq.cn/article/key…
  • Tech.meituan.com/2017/04/21/…
  • www.cnblogs.com/jajian/p/11…
  • www.cnblogs.com/butterfly10…