With the rapid development of Internet, can lead to huge amounts of data, the data volume is relatively small, the traditional processing way is to store the data in relationship or non-relational database, but gradually increased with the amount of data, a single database table has been difficult to accommodate all the data, so the concept of industry appeared the depots table. Using the idea of knowledge, the data is perfectly split, but it also brings many thorny problems, such as the introduction of distributed transactions, capacity expansion and so on.

History of database usage

Using a database in an application goes through three phases

  • Single database and single table: in the initial application phase, the amount of data in this phase is less than the threshold of the database, so the application performance is not affected.
  • Single database table, because a table in the database database is too large, has a certain impact on the performance of the application, such as query, etc., a table will be divided into TABLE_1, TABLE_2,table_N, a table split N small tables. Note The disk capacity is sufficient at this stage. However, it is more about the partitioning of the database that is used. The partitioning principle is very similar to the partitioning principle of the mysql hash

Recently organized a summary of information, information covering the first line of large factory Java interview questions summary + each knowledge point learning thinking guide + a 300 page PDF document Java core knowledge points summary! Partners who wish to receive the PDF can scan the image below to receive the document for free


CREATE TABLE `test_user_hash` ( `user_id` bigint(19) NOT NULL, `user_name` varchar(50) NOT NULL, `ext_int` int(2) NOT NULL, `ts` bigint(19) NOT NULL, PRIMARY KEY (`user_id`,`ext_int`)) ENGINE=InnoDB DEFAULT CHARSET=utf8; ALTER TABLE 'test_user_hash' PARTITION BY HASH(ext_int) PARTITIONS 3; Copy the codeCopy the code

In the mysql database, the storage format is divided into three files because the above is based on 3hash

  • Sub-database sub-table, the above two can not be solved, the emergence of sub-database sub-table scheme, that is, the single database data scattered in multiple databases.

When do we need separate tables

In principle, the database can be divided as far as possible. When it is unavoidable or there has been a trend to show that the database is divided into tables, the database is divided into tables.

  • The throughput of database has reached the bottleneck, and it needs to expand multiple database instances to improve it.
  • When the amount of data in a table reaches a certain level, the performance of application queries is significantly affected. You can improve the performance by dividing databases and tables. Some data shows that the query performance is affected when the amount of data in a single table of Mysql database exceeds 5000W
  • To avoid complex capacity expansion, estimate the data volume in N years based on the data growth trend count, count/single database capacity = required database instances, which is planned in advance to prevent future problems.

Common resolution schemes

There are two common split schemes: vertical split and horizontal split, database split table is a common solution to the database split.

  • The vertical resolution

Vertical split is based on service characteristics, some related tables are centrally stored in a DB, and the data volume of these tables is generally not too large. For example, there are user modules and order modules in the e-commerce system

  • Horizontal split

The same table structure exists in each DB and the data is spread across multiple dB according to certain rules

Sub-library sub-table implementation scheme

There are three main implementation schemes as follows

  • Client Sharding
  • Proxy implementation sharding
  • Distributed database

Client Sharding

Client sharding is generally implemented in two ways. One is directly implemented at the application layer, which contains sharding logic and sharding algorithm and is tightly coupled with the business code

The application layer implements all the logic, and the business people need to be involved.

The other is to implement the standard JDBC protocol, to provide the application wrapped JDBC, to use the application insensitive, implementation logic as a JAR, embedded in the application, the application can be flexibly switched

This method implements the standard JDBC interface and has no impact on the application using native JDBC. The two methods follow the uniform specification and have the advantage of decoupling from the business code compared with the first method. Increase flexibility.

Agent shard

Way to realize the way is to increase agent layer between the application and the database, the deployment of independence, the agent ACTS as the role of the data, use a proxy is equivalent to the database for application, in principle, use the proxy and the direct use of database is no difference, but the agent, after all, not real database broker layer just solve how to make full use of the database resources, The agent layer implements all the sub-database and sub-table logic, including sharding rules, etc. Business people need not pay attention to it and can spend more time on the business implementation logic.

Typically, a layer of load is added outside the agent layer.

This way can let business people more focus on business, but complexity compared to the first one much higher, increase the communication link, involve the protocol conversion, so will the performance compared to the first scheme has obvious loss, at the same time also to the requirement of personnel is higher, need technology to support for sure, otherwise once the problems difficult to handle. More familiar with Mycat, because I have done deep secondary development based on Mycat, have a certain understanding of the source code, defects really very… , hope users carefully consider, beside the point o(), o

Distributed database

TiDB provides a scalable architecture for external use, and provides certain distributed transactions. The scalable and distributed transactions are wrapped in the internal implementation, and users do not need to directly control these features. For example, TiDB provides JDBC interface, and the application layer uses TiDB in the same way as the directly connected MySQL database

Problems brought by separate database and separate table

  • After data is segmented, it is scattered in different DB. When the native Join operation of database is used, cross-library Join exists and the performance is poor.
  • With the introduction of distributed transactions, the consistency of distributed transactions is difficult to solve.
  • For example, if you want to query 10 items of data 100w later, limit 100000010.
  • The difficulty of capacity expansion without shutdown increases

Recently organized a summary of information, information covering the first line of large factory Java interview questions summary + each knowledge point learning thinking guide + a 300 page PDF document Java core knowledge points summary!

Partners who want to receive the PDF can get the document for free by scanning the image below