background

A few days ago, I heard a colleague interviewing candidates on the phone at the mobile station. He asked me, “Have you ever contacted with the inventory and inventory list? What are the problems to solve?”

Yes, what is the principle of separate database and separate table, and how should it be realized? Was MY idea right at that time?

introduce

Single library table problem

Usually at the beginning of a business, a set of related data is stored in a single table, such as the User table, article table, and so on. With the development of services, the amount of data stored in a single table may be huge, reaching tens or even hundreds of millions. At the same time, a table may have dozens or hundreds of fields, which may cause a series of problems:

  • Insufficient resources
    • The IO and connection number of a single machine are limited. Once the concurrency is too high, DB will become the bottleneck of the whole system
  • Read and write efficiency decreases
    • Write operations, lock conflicts will be more obvious
    • For read operations, some batch queries must be accompanied by frequent I/OS due to the large amount of data and too much single row data. Therefore, it is difficult to hit the cache
  • Hot and cold data mixed
    • Some of the fields don’t actually need to be read or written very often, resulting in hot and cold rows of data
  • The index inflation
    • The increase in data volume is accompanied by the increase in index volume, and the operation time of index is also increased

The paving work before the sub-database sub-table

Do not over-design, do not optimize too early.

Do not divide libraries and tables in the pursuit of introducing new technologies, which can lead to exponential system complexity. Before carrying out the sub-table, at least the following paving work should be done:

  • SQL optimization reduces slow queries and frees up connection resources
  • Index optimization reduces non-index queries, and the cost of new indexes is much lower than architectural changes
  • Increase cache to reduce DB pressure, low transformation cost
  • Master/slave write operations and read operations in transactions go to the master library, while other read operations go to the slave library. Note the master/slave latency

For most business systems, there are far more reads than writes, and the problem should be greatly alleviated by increasing caching and separating master from slave.

But even if the above work is done, it still does not solve the fundamental single table data volume is too large and too many fields, in the process of business development, the problem will certainly occur again. At this point, there are few good ways to do this other than to divide the database into tables or modify the business logic to use a NoSql database.

What are horizontal/vertical sub-libraries/sub-tables

  • Branching means that tables stored in a single repository are distributed in multiple libraries
  • Split table Stores a large table across multiple tables

The vertical table

Split the table by column according to the correlation degree or business attributes, for example, split the frequently displayed “core information” and the corresponding “detail information”.

The structures of the split tables are different. The data is correlated with each other and overlaps a little, for example, user_id.

The level of table

Store rows in different tables according to certain policies, such as split by creation time/ID hash, etc.

The split table has the same structure but does not store overlapping data.

Vertical depots

Separate tables with low business correlation into different libraries, for example, “user database” and “commodity database”.

The level of depots

The multiple tables after horizontal division are again dispersed to different libraries for storage to reduce the pressure of single library.

Problems solved by separate database and separate table

Core purpose of sub-library:

  • Expand the machine resources of the original single library to multiple libraries, including IO, connection number, etc
  • Separate hot and cold data to allocate more resources to frequently accessed data

The core purpose of the sub-table:

  • Reduce the number of single-table columns, reduce the size of data blocks, reduce the NUMBER of I/OS and improve cache utilization, reduce the pressure of single-table I/OS (vertical sub-table)
  • Hot and cold data separation, data can be divided into infrequently changed data and frequently changed data, and different access strategies can be assigned (vertical partition table)
  • Reduce the number of rows in a single table and store data across multiple tables (horizontal table)
  • Separation of hot and cold data, processing of current data and outdated data separately in some time-sensitive scenarios (horizontal sub-table)
  • Reduce the number of indexes in a single table to improve read and write efficiency (horizontal sub-table)
  • The premise of some sub-database operations, that is, the table can be distributed in different libraries after the design of sub-table

Sub – library sub – table my thinking

  1. An intermediate layer is established to make non-invasive connection between the application layer and the database, and all configuration and routing problems are handled in the middle layer
  2. Design routing rules and configure the required libraries
  3. Receive and parse SQL statements at the application layer, and extract key information such as table names and field names
  4. The name of the target database table is calculated according to the routing rules. If the query targets are distributed in multiple database tables, split the single query into multiple queries
  5. Generate a new SQL statement or replace an existing statement
  6. Query in the corresponding library table
  7. Merge the returned results based on the query statement type, such as COUNT, SUM, ORDER BY, and so on
  8. Return the query result

Of course, in practical applications, more system protection needs to be considered, including degraded fuses, read/write separation, downtime, capacity expansion, and so on.

How to implement the sub-table

Common sub-database sub-table middleware are:

  • ShardingSphereFormerly known as Sharding-JDBC, it is now a top-level project of the Apache Foundation
  • TDDLAli open source (open source?)
  • MyCatFolk open source

The branch database and branch table middleware mainly deals with the problem of horizontal branch database and branch table, while the vertical branch database and branch table itself processing logic is relatively simple and easy to implement and maintain in the application layer.

The most important capabilities of database and table are SQL parsing, routing and forwarding, as well as related supporting capabilities including configuration and capacity expansion.

Taking TDDL as an example, the processing flow is as follows:

Generally speaking, business evolution and transformation process:

  1. Single library table
    • In this case, you need to consider data disaster recovery (Dr)
    • Data volume of tens of millions of complete support
  2. SQL optimization
    • Avoid slow query that occupies connection resources
  3. The index optimization
    • Optimization index has low relative cost and high input-output ratio
  4. Added cache/read/write separation
    • Most businesses that read too much and write too little help
  5. Vertical depots
    • Increase machine resources and reduce query pressure
    • When in-depth optimization begins, transformation becomes more difficult and maintenance becomes complicated. Attention should be paid to the following aspects:
      • Data migration
      • Distributed transaction
      • Unique ID of the play
      • Cold and hot data separation, can be separate to the hot area data sub-database sub-table
      • Server Resource Planning
  6. The vertical table
    • The cost of transformation is higher, and the logic at the application layer may need to be modified, for example, double-write data
  7. The level of table
    • When the amount of data in a single table is too large, you need to use middleware to divide tables
  8. The level of depots
    • Capacity expansion: Adds machine resources
  9. If that’s not enough, try another storage option, such as a NoSql database

Problems brought by separate database and separate table

Since the primary key

After tables are divided, ids cannot be added to each table synchronously. In this case, a globally unique ID generation scheme, such as Snowflake, is required.

Data update

Some single-table updates become multi-table updates after vertical database and table partitioning.

Database transaction

Because of the cross – library problem, the original database transaction is difficult to support, need to use distributed transaction implementation.

Associated operation

Because cross-library, join and other operations cannot be implemented, associated query is difficult to achieve, most applications through redundant storage or other storage methods to achieve.

Foreign key constraint/full text index

There is no way to support full-text indexes or foreign key constraints after a split table, and as some references point out, it is best to avoid such operations as much as possible in today’s business itself, and I have never encountered an application scenario in my experience.

Performance overhead

Some of the original single-table queries may become multi-table queries after the sub-database sub-table, and additional data splicing processing overhead. But these extra costs are acceptable compared to the pressure of reading and writing massive data from a single table.

Operations management

From single database single table to sub-database sub-table, maintenance difficulty increases exponentially, especially when it involves capacity expansion, downtime and so on.

conclusion

In essence, the sub-database and sub-table middleware separates the multi-table query and result processing capability of application layer for abstraction and standardization.

Although the core idea is not complicated, the systematization of the entire service, including maintenance and disaster recovery, is the most difficult part. At the same time, when the interviewer asked the question of database and table, then can expand the content is very rich, including global unique ID, database lock, distributed transaction, master/slave synchronization, load balancing, consistent hash, limiting traffic, cache…… Being able to relate to a lot of knowledge is a great place to start.

The resources

Overview: : ShardingSphere

Distributed database middleware – Introduction of TDDL – Zhihu

No NoSQL/NewSQL, also do not partition, direct sub-database sub-table! _ storage

Mycat1.6

This article moves my blog, welcome to visit!