1. Why separate databases and tables

Certainly is the database performance bottleneck, it needs to be divided into tables, but as long as the performance bottleneck problem must be divided into tables? Not necessarily

1.1 What problems may occur in the database

  • Unable to get a connection. The number of database connections is a resource. Due to the limitation of database hardware, it is generally necessary to control the maximum number of connections. When the amount of concurrency reaches a certain level, the database connection will be insufficient.
  • Operation data becomes slow. The amount of data in a table is too large, slowing down insert, update, query, and delete operations.
  • The disk is insufficient. The disk capacity of a database is insufficient or close to the maximum capacity due to a large amount of data. As a result, data writes and queries are slow.

1.2 How should problems be analyzed and solved

How should we think about these questions?

  • Failure to obtain connections :(1) you can change a single operation to batch operation, especially to create records, to reduce the number of connections. (2) Modify the database configuration, increase the maximum number of connections and buffer.
  • Slow operation data problems :(1) optimize SQL and indexes. For slow query statements, you can optimize the query SQL to use indexes as much as possible. For update and insert statements, you can change a single operation to a batch operation. (2) Optimize table structure and storage engine. Select a storage engine based on the usage scenario of the table (read more and write less, write more and read less), or split and optimize the table fields. (3) Use cluster, do load balancing, use master and slave library to do read and write separation. (4) Cache. Do a layer of caching before querying the database. (5) Sub-database sub-table. Storing data in different databases or tables reduces the stress of database storage and access.
  • Insufficient disk problems :(1) sub-library sub-table. Reduce the stress of single database storage. (2) Expand disk space

2. What problems do sub-database sub-tables solve

It addresses the stress of single database storage and access. After the repository is split, a single database stores less data and responds to fewer requests.

In general, performance problems can be solved by dividing database and table, but dividing database and table is not the optimal solution of performance problems. For example, the connection cannot be obtained because the maximum number of connections for a single database is limited, and the number of connections can be expanded if the database is divided. For example, the operation of the data is slow, and the operation is also divided into multiple database vendors, so the operation will be accelerated. For example, if the disk is insufficient, different libraries in different physical machine vendors will have larger disk capacity than a single database.

3. What are the problems caused by the sub-table

3.1 Distributed consistency

Because data is stored on different nodes, data inconsistency may occur. A study of distributed inconsistency can be found in the article distributed consistency

3.2 Cross-node association table

What if table A of library A needs to associate query with table B of library B? The solutions are as follows:

  • Redundant fields. If only one or several fields of table B need to be associated with table A, you can consider redundancy of these fields in table A so that table B does not need to be associated with table A.
  • Basic data redundancy table. Some basic data, such as countries and currencies, are associated with different databases and stored in each database redundantly.
  • Redundant service tables. In the scenario where “table A of library A needs to be queried associated with table B of library B”, database A can store table B of library B redundantly and keep the data consistent with table B of library B through the data synchronization mechanism. In this way, database A can directly associate redundant tables.
  • The binding table. The data to be associated is placed on a node. For example, the database and table are divided by tenant. There is no associated query between different tenants. The data of the same tenant can be directly associated on the same node.

When there is a problem associated with cross-reception, you must think about whether the business logic is reasonable. If cross-node association is really needed, it is generally through RPC to call another node’s data, assemble it and then return it to the front end or a third party, without external perception.

3.3 Cross-node paging, sorting, function calculation

Limit paging and order by sorting can occur when querying across multiple libraries. For example, if there are two nodes, node 1 stores an odd number of ids =1,3,5,7,9… ; Id =2,4,6,8,10… Run the select*from user_info order by id limit 0,10 command to fetch 10 items from each node, merge the data, and sort the items again.

When functions such as Max, min, sum and count are calculated, corresponding functions need to be performed on each shard first, and then the result sets of each shard are summarized and calculated again, and finally the results are returned.

If the i18N table is large, the performance of the i18N table is very low. If the I18N table is large, the performance of the i18N table is low.

3.4 Global Primary Keys Are Not Unique

The MySQL database has an incremented field, and Oracle has a Sequence. If it is a database, then you can ensure that the ID is not repeated, but after the horizontal table, each table in accordance with its own law of self-increment, there will certainly be the problem of ID duplication, this time we can not use the local self-increment way.

Solution:

  • UUID. The disadvantages are not incremental and can be repeated in a distributed environment.
  • Store ids uniformly based on the database. All services that require ids are fetched from the same database, which ensures that ids are incremented and unique. But this adds a layer of calls, which can also strain the database when requests are high.
  • Based on the redis. Faster than the database, but also added a layer of call, there is a failure, timeout risk
  • Snowflake algorithm. However, it relies heavily on the machine clock, and if the clock is rolled back, duplicate ids may be generated.

3.5 Database expansion and Data migration

Some sub-database and sub-table strategies are inconvenient during database expansion. For example, when the odd-even table by ID is still too large and another table needs to be added one day, a table division strategy needs to be changed, such as module 3 or module 5, and the data of the previous two tables need to be read out and divided again.

Some tables are divided by month. If one table is added each month, historical data does not need to be migrated.

If value range sharding is used, you only need to add nodes for capacity expansion, and there is no need to migrate shard data. It is relatively troublesome to consider the expansion problem in the later period if the numerical module sharding is adopted.

4 reference

[1] What are the thorny problems caused by MySQL database and table? [2] Problems and solutions to be considered in sub-database sub-table [3] Why is it necessary to sub-database sub-table [4] Distributed consistency those things