background
- At present, the core table of storage project (file information table) has reached the level of single table 10 billion (17 billion +). With the growth of business, the growth rate of single table is very fast, more than double the growth rate of last year, so it is necessary to make a plan for splitting in advance.
- The storage and access of massive data will undoubtedly cause great pressure on MySQL database, and put forward high requirements for the stability and scalability of the system.
- The resources of a single server (CPU, disk, memory, etc.) are always limited, and eventually the database can bear the amount of data and data processing capacity will encounter bottlenecks.
Database status
Table structure and index
CREATE TABLE `file_info` (
`id` bigint unsigned NOT NULL AUTO_INCREMENT,
`app_id` smallint unsigned NOT NULL DEFAULT '0',
`file_key` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT ' ',
`file_size` int unsigned NOT NULL DEFAULT '0',
`service_provider_id` tinyint unsigned NOT NULL DEFAULT '0',
`bucket_id` tinyint unsigned NOT NULL DEFAULT '0',
`store_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'File storage time'.PRIMARY KEY (`id`),
UNIQUE KEY `uiq_file_key_appid` (`file_key`,`app_id`),
KEY `idx_appid_filekey` (`app_id`,`file_key`),
KEY `idx_store_time` (`store_time`)
) ENGINE=InnoDB AUTO_INCREMENT=19143687461 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC
Copy the code
- Statistics: field number 7, index number 4
- The index tree height: The tree heights of the four indexes (ID, UIq_FILe_KEY_APPID, IDx_APPid_filekey, idx_store_time) are layer 4, layer 5, layer 5, and layer 4 respectively
- Calculate reference: www.jianshu.com/p/67760a755…
Query the comb
- This data table core process only involves the precise query of appId+fileKey, or the batch query of fileKey (in query), and does not involve complex query operations such as joint tables
The amount of data
- The current data volume of file information table is 17 billion +, the data capacity is 1.5TB, and the index capacity is 3.4TB, and the growth rate is fast
- Statistical method:
SELECT
table_name as 'the name of the table',
table_rows as 'Number of records'.truncate(data_length/1024/1024.2) as 'Data Size (MB)'.truncate(index_length/1024/1024.2) as 'Index Size (MB)'
FROM
information_schema.tables
WHERE
table_schema='${database name to be counted}'ORDER BY
data_length desc, index_length desc;
Copy the code
DB health
QPS
Average query time
CPU utilization
Disk read and write
IO situation
conclusion
- At present, the amount of data in a single table is too large and the growth rate is fast, and the database hardware resources and processing capacity are limited
- The index is too large and the single index tree has reached layer 5. The query performance is poor
Separation principle
- The original functions are not affected
- Keep it as simple as possible, without introducing complex components and logic
Break up way
For different situations, different processing methods are adopted, which are mainly classified as the following:
Filing/clearing
: For some tables, data inflation is mainly caused by garbage data or historical data, and this part of data will never be accessed after a certain period of time. Therefore, you can periodically clean or archive these tables.depots
: Non-service strongly related and relatively independent tables are removed from the main database and built independently to reduce the pressure of the main databasetable
: If services are strongly related and the amount of data in a single table expands significantly, separate tables are used for processing.
Priority: archive > sub-database > sub-table. If not necessary, try not to use sub-table for processing, which will introduce very high complexity.
Technology selection
Horizontal split There are many mature solutions in the industry, which fall into two main categories: client architecture (JDBC Proxy) and proxy architecture (DB Proxy)
- At present, although there are many sub-database sub-table middleware, open source is more active and mainstream
MyCat
和ShardingJDBC
These two projects, so the main focus on these two middleware - Reference: shardingsphere.apache.org/document/cu…
type | A flowchart | Mainstream project | The advantages and disadvantages |
---|---|---|---|
JDBC proxy | * ShardingSphere-JDBC | Advantages: 1. Low operation and maintenance costs Disadvantages: 1. Development language limitations 2. DB is not transparent to the business side 3. Many database connections are occupied |
|
DB proxy | * MyCat 2 * ShardingSphere-Proxy |
Advantages: 1. The service is not responsive and the service logic does not need to be changed 2. Heterogeneous support Disadvantages: 1. Link length increases time consumption (additional proxy layer) 2. Network single point problems 3. Heavy burden of operation and maintenance |
Conclusion: The final decision is to use jDBC-proxy and shardingsphere-JDBC component.
- For this table, the query logic is simple, and the service has a high query delay requirement
- At present, the company has several projects using this component to implement sub-database sub-table landing experience
- To reduce the operation and maintenance pressure, there is no personnel specialized in operation and maintenance of DB-proxy at present
- ShardingSphere open source community is active and well documented
digression
- For storing file record tables, which have low transaction requirements and simple queries, you can consider using NoSQL, which naturally supports horizontal expansion
- Use a distributed database such as TiDB
The detailed design
1) Separate table scheme
- The sub-table is mainly divided into
The vertical table
andThe level of table
, this time is mainly to solve the problem of too much data in a single table and reduce the pressure of a single database, so we choose horizontal sub-table. - There are several common types of horizontal subtables:
Table partitioning
:- Advantages: MySQL native support, transparent to users, 1. Can break the capacity bottleneck of a single table (physically separate multiple files to store data)
- Disadvantages: there are many restrictions, and the underlying logic of table partitioning is complex, which is difficult for developers to control
A single database table
: Similar to table partitioning, it does not solve physical bottlenecks at the library levelDepots table
: Effectively alleviates the performance bottleneck and pressure of single machine and single library, and breaks the bottleneck of I/O, connection number, and hardware resources
Conclusion: Whether the partition table or a single database table, are simply solve the problem of the single table data is too large, because there is no table data distribution to different machines, so to reduce the pressure of the MySQL server is not much effect (all competition on the same physical IO, CPU, network, etc.), to ease the pressure on the server, It is necessary to divide the database into tables to solve.
2) Fragmentation design
Subdivision principle
- Divide as much as you can
- Minimize the number of SQL shards. The more cross-shards a QUERY has, the worse the overall performance is. You need to expand the number of SQL shards when necessary
- The fragments should be evenly distributed as far as possible. Sharding rules need to be carefully selected and planned in advance. When selecting sharding rules, you need to consider the data growth mode, data access mode, sharding association, and sharding expansion
- Range sharding, enumeration sharding, and consistent Hash sharding all facilitate expansion
- In general, the choice of sharding is based on the condition of the most frequently queried SQL
Horizontal split
- The characteristics of the split table: all the query conditions are basically included
file_key
Field, andfileKey
Is itself a UUID random string used for the fileKey fieldThe hash tableYou can make the data more hashed, but the disadvantage is that IN queries can result IN cross-shard queries.
Hash function selection
- Common Hash algorithms include MD5, SHA-1, HMAC, HMAC-MD5, and HMAC-SHA1
- Java hash function: hashCode()
- Common collection frameworks such as HashMap use this method for hash computations
- Combined with the perturbation function, let the high order participate in hash, so as to better uniform subscript:
key.hashCode()) ^ (h >>> 16)
- High performance: average value after multiple tests (cycle number 100W)
- HashCode calculation takes about 10ms, MD5 calculation takes about 384ms, sha-1 calculation takes about 401ms
- Consistent hash algorithm
- Advantages: Some data needs to be repositioned when a node is added or deleted, and scalability is good
- Disadvantages:
- It is difficult to develop custom sharding algorithm
- There is also an increase in usage costs when using database tools for migration
- conclusion
- Theoretically, consistent hash algorithm can reduce the cost of data migration during capacity expansion. In fact, current elastic Scaling tools, such as the Scaling tool provided by ShardingSphere, cannot directly migrate the current cluster and must provide a new cluster, that is to say, all data will eventually be migrated again
- Node expansion itself is a very low frequency operation, in the first time when the database table is reasonably estimated, as far as possible in a long period of time after splitting does not need to expand again
- So start with the easy way: use
hashCode()
Method hashes the shard key and modulo it
Modulus algorithm
- Ordinary modulo taking functions:
mod()
orhash % length
- Using bit operations:
hash & (length - 1)
- High performance: bit operation and modulus operation performance comparison
- Disadvantages: The number of nodes needs to be 2 to the power of N, otherwise there will be uneven distribution
3) Distributed primary keys
If the database is divided into multiple tables, the global unique ID must be considered.
- Leverage database increment ids: Introduce a global library dedicated to generating increment ids, obtaining an ID before each insert
- Advantages: Convenient and simple
- Disadvantages: single point of risk, single machine performance bottleneck
- Leverage the database cluster and set the appropriate step size
- Advantages: High availability and simple ID
- Disadvantages: Separate database clusters are required
- Snowflake algorithm
- Advantages: High availability, easy to expand
- Disadvantages: Prevent machine ID conflicts during project cluster deployment
- UUID
- Advantages: Locally generated, almost impossible to conflict
- Disadvantages: Too long and unordered, not suitable for database primary key
Snowflake algorithm
The official snowflake algorithm/UUID is available (distributed sequence algorithm), here focuses on the use of snowflake algorithm.
- Brief introduction: Snowflake algorithm is a distributed primary key generation algorithm published by Twitter. It can ensure the non-repeatability of primary keys of different processes and the order of primary keys of the same process.
- In the same process, it is first guaranteed not to repeat by the time bit, or by the sequence bit if the time is the same. At the same time, because the time bits are monotonically increasing, and if each server synchronizes time roughly, the generated primary key can be considered as overall order in distributed environment, which ensures the high efficiency of index field insertion.
- Detailed structure:
- Snowflake algorithm generates ID duplication problem: worker process bit is the key
- Prerequisites for ID conflict:
- The service is deployed in a cluster with the same identifier bits on some machines
- Services have a certain amount of concurrency, without which duplicate problems cannot be triggered
- When the ID is generated: The sequence numbers are the same for the same millisecond
- Specific reference: juejin.cn/post/700758…
- Prerequisites for ID conflict:
4) Data synchronization
Since the split data table involves the core process, there must be rollback measures, that is, it can be cut back to the original database in case of problems (before the Stable implementation of Sharding scheme, data must be bidirectional synchronization, and the master database must contain full data).
- Use the official elastic scaling tool:ShardingSphere-Scaling
- This tool is currently in beta and is at risk for production use
- Only one-way synchronization can be performed
- DTS + sharding-proxy
- DTS bidirectional synchronization (master/slave delay within 1s)
- However, we have learned from the official technical support of Ali that DTS can only perform bidirectional synchronization for MySQL -> MySQL, but not for MySQL -> proxy, and can only perform one-way synchronization
- Business double write
- Stock data: Migrate from old to new repositories using database migration tools
- Incremental data: Modify the business code to achieve synchronous double-write of data to the old and new libraries
Conclusion: According to the current situation, to ensure bidirectional data synchronization, we can only rely on service modification to achieve synchronization double-write.
5) Smooth line
The first rule for database splitting is not to affect the service, so the online need to be careful, ready to roll back at any time.
Before the launch, due to the introduction of new components, and the service core process involves batch query, there may be cross-library query, it is best to pressure the core process first, to ensure that there will be no performance problems.
Smooth online process:
- Perform full data synchronization to synchronize data from the old library to the new library
- Dynamic versioning of services is implemented, new services are launched, and version 0 is configured. The read and write logic of version 0 is the same as that of the old service
- Objective: To ensure that the newly launched version can run normally and the risk can be controlled
- After confirming that the service is normal, configure version 1, double write, old first and then new, and read all data from the old library (this version runs for 1 day)
- Objective: To verify that the new library is written without exception while ensuring that the old library has complete data
- The business logic degrades the failure to write to the new library without affecting user requests, but notifies the developer of exceptions that fail to write
- Similar to the current master-slave mode, the secondary library hangs cannot affect the master library write
- Data that has not been synchronized to the new library before double-write (that is, incremental data generated between version 0 and version 1) needs to be replenished to the new library
- Objective: To verify that the new library is written without exception while ensuring that the old library has complete data
- Configure version 2, double write, first new then old, all data is read from the new library (because all the read traffic is switched to the new library, this version needs to run for 2 days)
- Purpose: Verify that the new library is read without exception
- In this version, ensure that the data of the new and old libraries is consistent. That is, the request is successful only when the synchronous double-write succeeds
- Removed the downgrade from version 1 that failed to write to the new library
- If something goes wrong with the new library at this point, we can manually switch back to version 1, although some requests may have failed. But in theory we have verified in version 1 that the new library writes are ok. If there is a problem here, it is likely that there is a database read problem or high availability problem
- In this version, ensure that the data of the new and old libraries is consistent. That is, the request is successful only when the synchronous double-write succeeds
- This version can also run for a while longer as needed, since the new library will be offline later
- Purpose: Verify that the new library is read without exception
- In version 3, the double write mechanism is removed, and only data is read and written in the new library. The old library is discarded
- Purpose: To complete the final form, smooth migration from the old library to the new library
- Depending on the business needs, it is more prudent to see if asynchronous one-way data synchronization from the new library to the old library is required using an asynchronous approach such as MQ
- In theory, running version 2 for a longer period of time should ensure a smooth transition, but synchronous double-write has a side effect: higher requirements for DB high availability, and performance is definitely reduced
- After running version 3 online for a period of time (perhaps a few days) without problems, remove the code logic associated with smooth transitions
conclusion
- choose
Depots table
In order to break the performance bottleneck of single library - choose
fileKey
Used as a sharding keyhashCode()
Method to hash the shard key, and then useAn operation
To calculate the data corresponding to the sublibrary - use
Snowflakes algorithm
As the primary key of the new library, pay attention to the proper dynamic configuration of the server process ID to prevent the generation of duplicate ids - use
Business double write
To ensure the data consistency between the old and new libraries - Implement dynamic multi-version control, step by step to verify the read and write ability of the new library, to achieve
Smooth transition
The purpose of
conclusion
- It is my great honor to have the opportunity to lead the design and practice of sub-database and sub-table scheme for the core project, and to share and summarize some experience
- For the scheme design, we mainly consult various materials and documents, consider the business situation, and conduct internal review and discussion. If you have any deficiencies or suggestions, please leave a comment in the comment section for discussion
- Recently, I also published columns related to image processing, as well as articles in the field of audio and video security, which are some experience and precipitation in the project practice. Interested friends can go to the home page to see
- The article is original, personal content and typesetting of the article have relatively high requirements, creation is not easy, if you think the article is helpful to you, you can like or click a attention ~
Refer to the article
-
Dianping order system sub-database and sub-table practice
-
Practice of new generation database TiDB in Meituan
-
JavaGuide – Read/write separation & library partition table
-
“Separate inventory and separate table”? Selection and process should be careful, otherwise it will be out of control
-
ShardingSphere official documentation