When the number of MySQL records in a single table is too large, the performance of add, delete, alter, and query deteriorates dramatically
Single table optimization
Unless the single table data will continue to grow in the future, do not consider splitting at first. Splitting will introduce various complexity of logic, deployment, operation and maintenance. Generally, tables with integer values under ten million, and strings under five million are not a big problem.
In fact, in many cases, the performance of MySQL single table still has a lot of optimization space, and even can normally support more than ten million levels of data volume.
field
- Try to use
TINYINT
,SMALLINT
,MEDIUM_INT
As an integer type instead ofINT
If it is not negative, plusUNSIGNED
VARCHAR
The length allocates only the space that is really needed- Use enumerations or integers instead of string types
- Try to use
TIMESTAMP
Rather thanDATETIME
- Do not have too many fields in a single table, recommend less than 20
- Avoid the use of
NULL
Fields that are difficult to query optimize and take up extra index space - Use an integer to store IP addresses
The index
- The more the better, the more the better, according to the query to create specific, consider in
WHERE
和ORDER BY
The columns involved in the command can be indexed according toEXPLAIN
To see if an index or full table scan is used - Should be avoided as far as possible
WHERE
Clause for the fieldNULL
Value, otherwise it will cause the engine to abandon the index and perform a full table scan - Fields with a sparse distribution of values, such as “gender,” have only two or three values
- Character fields are indexed by prefix only
- Character fields should preferably not be primary keys
- No foreign keys, the program guarantees the constraint
- As far as possible need not
UNIQUE
, guaranteed by the program - Use multi-column indexes in the same order as the query criteria, and remove unnecessary single-column indexes
The query SQL
- Slow SQL can be found by enabling slow query logging
- No column operations:
SELECT id WHERE age+1=10
, any operation on a column will result in a table scan, which includes database tutorial functions, computed expressions, and so on. Move the operation to the right of the equals sign whenever possible - SQL statements are as simple as possible: an SQL statement can be run on only one CPU; Large statement disassembly small statement, reduce lock time; One large SQL can block the entire library
- Don’t have to
SELECT *
OR
Rewrite intoIN
:OR
The efficiency is n,IN
Is log(n) efficient,IN
You are advised to keep the number of values within 200- No functions and triggers, implemented in the application
- avoid
%xxx
Type the query - To use less
JOIN
- Use the same type for comparison, such as ‘123’ to ‘123’, 123 to 123
- Try to avoid
WHERE
Clause! =
或<>
Otherwise, the engine abandons the index for a full table scan - For continuous values, use
BETWEEN
Don’t have toIN
:SELECT id FROM t WHERE num BETWEEN 1 AND 5
- Table data do not take the full table, use
LIMIT
Do pagination and keep the number of pages small
engine
MyISAM and InnoDB are two widely used engines:
MyISAM
The MyISAM engine is the default engine for MySQL 5.1 and earlier. It features:
- Row locks are not supported. All tables that need to be read are locked when read, and all tables are locked when written
- Transactions not supported
- Foreign keys are not supported
- Security recovery after a crash is not supported
- New records can be inserted into a table while a query is being read
- support
BLOB
和TEXT
Index of the first 500 characters of, support full-text index - Delay index update is supported to greatly improve write performance
- Tables that are not modified can be compressed to greatly reduce disk space usage
InnoDB
InnoDB became the default index after MySQL 5.5 and features:
- Line locking is supported and MVCC is used to support high concurrency
- Support transactions
- Support foreign keys
- Supports secure recovery after crashes
- Full-text indexing is not supported
In general, MyISAM is good for SELECT intensive tables, while InnoDB is good for INSERT and UPDATE intensive tables
System tuning parameters
You can use the following tools to benchmark:
- Sysbench: a modular, cross-platform, and multithreaded performance testing tool
- Iitable-mysql: insert performance test tool based on Java mysql /Percona/MariaDB index
- TPCC -mysql: TPC-C test tool developed by Percona
You can refer to official documents for details about tuning parameters. Here are some important parameters:
- Backlog: The backlog value indicates how many requests can be stored on the stack for a short time before MySQL temporarily stops answering new requests. That is, if MySql connection data reaches maxconnections, incoming requests will be stored on the backlog waiting for a connection to release resources. The backlog is called the backlog. If the number of waiting connections exceeds back_log, no connection resources will be granted. It can be raised from the default 50 to 500
- Wait_timeout: indicates the idle time of the database connection. The idle connection occupies memory resources. It can be reduced from the default 8 hours to half an hour
- Maxuserconnection: Specifies the maximum number of connections. The default value is 0
- Thread_concurrency: set the number of concurrent threads to twice the number of CPU cores
- Skipnameresolve: disables DNS resolution for external connections, eliminating the DNS resolution time, but requires all remote hosts to use IP addresses for access
- Keybuffersize: indicates the cache size of the index block. Increasing the cache size will improve the index processing speed and has the greatest impact on MyISAM table performance. For about 4G memory, the value can be 256M or 384M
show status like'key_read%'
Keep key_reads/key_read_requests below 0.1% - Innodbbufferpool_size: Cache data blocks and index blocks, which have the greatest impact on InnoDB table performance. Through the query
show status like'Innodb_buffer_pool_read%'
, guarantee (Innodb_buffer_pool_read_requests — Innodb_buffer_pool_reads)/Innodb_buffer_pool_read_requests as high as possible - Innodbadditionalmempoolsize: InnoDB storage engine is used to store data dictionary information and some internal data structures. When there are many database objects, adjust the size of this parameter to ensure that all data can be stored in memory to improve access efficiency. When too small, MySQL will record Warning information in the database error log, so you need to adjust the size of this parameter
- Innodblogbuffer_size: the buffer used by InnoDB to store transaction logs of the engine. Generally, it is recommended not to exceed 32MB
- Querycachesize: caches a ResultSet in MySQL, which is the ResultSet of an SQL statement
select
Statements. Any change to a table’s data results in all references to the tableselect
Statements inQuery CacheThe cache data in. So, when our data changes very frequently, useQuery CacheYou may lose more than you gain. Adjust according to hit ratio (Qcache_hits/(Qcache_hits + Qcache_inserts) * 100)). Generally, large size is not recommended; 256MB May be sufficient, and large configuration static data can be scaled appropriately. You can do this by commandshow status like'Qcache_%'
Viewing the Current SystemQuery CatchUse the size - Readbuffersize: the size of the MySql readbuffer. A sequential scan of the table will allocate a read buffer, and MySql will allocate a memory buffer for it. If sequential scan requests to the table are very frequent, you can improve performance by increasing the value of this variable as well as the memory buffer size
- Sortbuffersize: buffersize used by MySql to perform sorting. If you want to increase
ORDER BY
First, see if you can get MySQL to use indexes instead of the extra sort phase. If not, try increasing the size of the sortBufferSize variable - Readrndbuffer_size: random read buffer size of MySql. When rows are read in any order (for example, in sort order), a random read cache is allocated. When sorting a query, MySql scans the buffer first to avoid disk search and improve the query speed. If a large amount of data needs to be sorted, you can adjust the value to a higher value. MySql allocates this buffer space for each client connection, so try to set this value appropriately to avoid excessive memory overhead
- Record_buffer: Each thread doing a sequential scan allocates a buffer of this size for each table it scans. If you do a lot of sequential scans, you may want to increase this value
- Threadcachesize: Stores threads that are not currently associated with a connection but are ready to service a later connection, allowing you to quickly respond to connected thread requests without creating a new one
- Tablecache: similar to threadcache_size, but used to cache table files. Not very useful for InnoDB, mainly used for MyISAM
Upgrade the hardware
Scale up, depending on whether MySQL is CPU intensive or I/O intensive, can significantly improve MySQL performance by increasing CPU and memory, and using SSDS.
Reading and writing separation
It is also a commonly used optimization, read from the library and write from the master library, generally do not use double master or multiple master to introduce a lot of complexity, try to use other schemes in the paper to improve performance. At the same time, many split solutions also consider read and write separation.
The cache
Caching can occur at these levels:
- MySQL Internal: Describes related Settings in system tuning parameters
- Data access layer: For example, MyBatis caches SQL statements, whereas Hibernate caches down to a single record. The cached objects are persistenceObjects
- Application service layer: more precise control and more implementation strategies can be achieved for cache through programming means. The object cached here is the DataTransferObject DataTransferObject (DTO).
- Web layer: Cache Web pages
- Browser client: the client’s cache
Caches can be added at one or more levels in combination depending on the situation. The cache implementation of the service layer is mainly introduced here. At present, there are two main ways:
- Write Through: After data is written to the database, the cache is updated to ensure consistency between the database and cache. This is how most current application caching frameworks such as Spring Cache work. This implementation is very simple, synchronized well, but not very efficient.
- Write Back: When data is written to the database, only the cache is updated, and then the cached data is asynchronously synchronized to the database in batches. This kind of implementation is more complex, requires more application logic, and may cause the database and cache synchronization, but the efficiency is very high.
Table partitioning
The partition introduced in MySQL 5.1 is a simple horizontal split that requires users to add partition parameters when building a table and is transparent to the application without changing the code
To the user, a partitioned table is a separate logical table, but the underlying table is composed of multiple physical sub-tables. The code that implements partitioning is actually a black box that encapsulates the underlying table through object encapsulation of a set of underlying tables, but is a black box that completely encapsulates the underlying table for the SQL layer. The way MySQL implements partitioning also means that indexes are defined as subtables of partitions, with no global indexes.
The user’s SQL statement needs to be optimized for the partition table. The column of the partition condition should be included in the SQL condition to locate the query to a small number of PARTITIONS, otherwise all PARTITIONS will be scanned. You can check which PARTITIONS a CERTAIN SQL statement will fall on by EXPLAIN PARTITIONS. In order to perform SQL optimization, as shown in the following figure, 5 records fall on two partitions:
The benefits of partitioning are:
- More data can be stored in a single table
- The data in partitioned tables is easier to maintain, and large amounts of data can be removed in batches by cleaning the entire partition, or new partitions can be added to support newly inserted data. In addition, a separate partition can be optimized, checked, repaired, and so on
- Partial queries can be determined from query criteria to fall on only a few partitions, which can be very fast
- Partitioned table data can also be distributed across different physical devices, making efficient use of multiple hardware devices
- You can use partitioned table latencies to avoid specific bottlenecks, such as mutually exclusive access for InnoDB single indexes and inode lock contention for ext3 file systems
- Individual partitions can be backed up and restored
Limitations and disadvantages of partitioning:
- A table can have a maximum of 1024 partitions
- If a partitioned field has columns with a primary key or unique index, all primary key and unique index columns must be included
- Partitioned tables cannot use foreign key constraints
- A NULL value invalidates partition filtering
- All partitions must use the same storage engine
Type of partition:
- RANGE partition: Allocates multiple rows to a partition based on column values belonging to a given contiguous interval
- LIST partitioning: Similar to partitioning by RANGE, except that LIST partitioning is selected based on column values matching a value from a discrete set of values
- HASH partition: A partition selected based on the return value of a user-defined expression computed using the column values of the rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non-negative integer value
- KEY partitioning: Similar to HASH partitioning, except that KEY partitioning supports only one or more columns and the MySQL server provides its own HASH function. One or more columns must contain integer values
Partitioning is suitable for the following scenarios:
The most suitable scene data has a strong time series, so it can be partitioned by time, as shown below:
Adding time range conditions to query is very efficient, and it is easy to delete unnecessary historical data in batches.
If the data has an obvious hot spot and is rarely accessed, you can place the hot spot in a separate partition so that the data in that partition has a chance to be cached in memory and only a small partitioned table is accessed for query purposes, making efficient use of indexes and caching
In addition, MySQL has an early simple partition implementation – merge table, which is more limited and lacks optimization. It is not recommended to use the new partition mechanism instead
The vertical resolution
Vertical segmentation is based on the correlation of data tables in the database to split. For example, if a database has both user data and order data, vertical split can put user data into the user library and order data into the order library. Vertical split table is a method of vertical split of data tables. It is common to split a large multi-field table according to common and unusual fields. The number of data records in each table is generally the same, but the fields are different, and primary key association is used
For example, the original user table is:
The vertical split is:
The advantages of vertical splitting are:
- You can make rows smaller so that more data can be stored in one Block, resulting in fewer I/O times (fewer blocks read per query)
- To maximize the use of the Cache, you can place infrequently changed fields together and frequently changed fields together in vertical split
- Simple data maintenance
The disadvantage is that:
- The primary key is redundant and redundant columns need to be managed
- The JOIN operation that causes a table JOIN (which increases CPU overhead) can reduce database stress by joining on the business server
- There is still the problem of large amount of single table data (horizontal split)
- Transaction complexity
Horizontal split
An overview of the
Horizontal split is a strategy to store data in fragments, which are divided into tables and libraries within a database. Each piece of data is dispersed to different MySQL tables or libraries to achieve a distributed effect, which can support a very large amount of data. The preceding table partition is essentially a special kind of in-library partition
Table, just simply solve the problem of the single table data is too large, because there is no table of the data distribution to different machines, so to alleviate the pressure of the MySQL server, the role of is not much, everyone still compete on the same physical IO, CPU, network, this will be addressed by depots
If the user table was split horizontally, the result would be:
In practice, this tends to be a combination of vertical split and horizontal split, with Users_A_M and Users_N_Z split into Users and UserExtras for a total of four tables
The advantages of horizontal splitting are:
- There are no performance bottlenecks of single repository big data and high concurrency
- Application side modification is less
- The stability and load capacity of the system are improved
The disadvantage is that:
- Shard transaction consistency is difficult to resolve
- Cross-node Join has poor performance and complicated logic
- It is very difficult to expand data for many times and maintain it
Subdivision principle
- If you can, refer to single table optimization
- The number of fragments should be as small as possible, and the fragments should be evenly distributed on multiple data nodes. The more cross-fragments a query has, the worse its overall performance is. Although it is better than the result of all data in one fragment, it should only be expanded when necessary to increase the number of fragments
- The sharding rules need to be carefully selected and planned in advance. When selecting sharding rules, you need to consider the growth mode of data, access mode of data, correlation of sharding, and expansion of sharding. The latest sharding strategies are range sharding, enumeration sharding, and consistent Hash sharding, which are conducive to expansion
- Try not to have SQL in a transaction that spans multiple shards. Distributed transactions are always a problem
- Query conditions should be optimized and avoided
Select *
A large number of result sets consume a large amount of bandwidth and CPU resources. Avoid returning a large number of result sets and create indexes for frequently used query statements. - Reduce the likelihood of cross-library joins through data redundancy and table partition dependencies
Here especially emphasize the selection problem of fragmentation rules, if a table of data has the obvious time characteristic, such as orders, transaction records, etc., are often more appropriate shard with time range, because has the timeliness of data, we tend to focus on the recent data, often with time field to filter query conditions, a better solution is to, Currently active data is sharded with a short span, while historical data is stored with a long span.
Generally speaking, the choice of shard depends on the condition of the most frequently queried SQL, because the query SQL without any Where statement will traverse all shards and has the worst performance. Therefore, the more such SQL, the greater the impact on the system, so we should try to avoid such SQL.
The solution
Due to the complicated logic involved in horizontal split, there are many mature solutions. These programmes fall into two broad categories:
- Client architecture
- Agent architecture
Client architecture
By modifying the Data access layer, such as JDBC, Data Source and MyBatis, it can manage multiple Data sources through configuration, directly connect to the database, and complete the fragmentation integration of Data within the module, which is generally presented in the form of Jar packages
Here is an example of a client architecture:
You can see that sharding is implemented in conjunction with the application server by modifying the Spring JDBC layer
The advantages of the client architecture are:
- Use directly connected databases to reduce downtime risks associated with peripheral system dependencies
- Low integration cost, no additional o&M components
The disadvantage is that:
- Limited to the database access layer, general scalability, for more complex systems may be inadequate
- Placing the stress of sharding logic on the application server creates additional risk
Agent architecture
Unified management of all data sources and data fragmentation integration through independent middleware, back-end database cluster transparent to front-end applications, independent deployment and operation of agent components
Here is an example of a proxy architecture:
Proxy components exist in clusters to avoid single points and may require service components such as Zookeeper to manage them
The advantages of the proxy architecture are:
- Able to handle very complex requirements, not limited by the original implementation of the database access layer, scalability
- Transparent to the application server without any additional load
The disadvantage is that:
- Independent proxy middleware needs to be deployed and operated, which is costly
- The application needs to go through the proxy to connect to the database. An extra hop on the network leads to performance loss and additional risks
Comparison of different schemes
The framework | producers | Architectural model | Support database | depots | table | Reading and writing separation | External dependencies | Whether open source | Implementation language | Support language | Lot number of star |
---|---|---|---|---|---|---|---|---|---|---|---|
MySQL Fabric | MySQL official | Agent architecture | MySQL | There are | There are | There are | There is no | is | python | unlimited | 35 |
Cobar | alibaba | Agent architecture | MySQL | There are | There is no | There is no | There is no | is | Java | unlimited | 1287 |
Cobar Client | alibaba | Client architecture | MySQL | There are | There is no | There is no | There is no | is | Java | Java | 344 |
TDDL | taobao | Client architecture | unlimited | There are | There are | There are | Diamond | Open source only | Java | Java | 519 |
Atlas | Qihoo 360 | Agent architecture | MySQL | There are | There are | There are | There is no | is | C | unlimited | 1941 |
Heisenberg | Baidu Xiong Zhao | Agent architecture | MySQL | There are | There are | There are | There is no | is | Java | unlimited | 197 |
TribeDB | personal | Agent architecture | MySQL | There are | There are | There are | There is no | is | NodeJS | unlimited | 126 |
Sharding JDBC | dangdang | Client architecture | MySQL | There are | There are | There are | There is no | is | Java | Java | 1144 |
Shark | personal | Client architecture | MySQL | There are | There are | There is no | There is no | is | Java | Java | 84 |
KingShard | personal | Agent architecture | MySQL | There are | There are | There are | There is no | is | Golang | unlimited | 1836 |
OneProxy | Civilian software | Agent architecture | MySQL | There are | There are | There are | There is no | no | The unknown | unlimited | The unknown |
MyCat | community | Agent architecture | MySQL | There are | There are | There are | There is no | is | Java | unlimited | 1270 |
Vitess | Youtube | Agent architecture | MySQL | There are | There are | There are | There is no | is | Golang | unlimited | 3636 |
Mixer | personal | Agent architecture | MySQL | There are | There are | There is no | There is no | is | Golang | unlimited | 472 |
JetPants | Tumblr | Client architecture | MySQL | There are | There are | There is no | There is no | is | Ruby | Ruby | 957 |
HibernateShard | Hibernate | Client architecture | unlimited | There are | There are | There is no | There is no | is | Java | Java | 57 |
MybatisShard | MakerSoft | Client architecture | unlimited | There are | There are | There is no | There is no | is | Java | Java | 119 |
Gizzard | Agent architecture | unlimited | There are | There are | There is no | There is no | is | Java | unlimited | 2087 |
So many programs, how to choose? Think of it in the following way:
- Determine whether to use the proxy or client architecture. Small – to medium-sized or simple scenarios tend to choose the client architecture, while complex scenarios or large-scale systems tend to choose the agent architecture
- Whether specific functions are met, such as cross-node
ORDER BY
, then support this function is a priority - Don’t consider a product that hasn’t been updated in a year, which means development is stagnant, or even without maintenance or technical support
- Had better press big company -> community -> small company -> individual such production square order will choose
- Choose good reputation, such as GitHub star count, user quantity and quality, and user feedback
- Open source is preferred, often projects with special needs may need to modify the source code
According to the above ideas, the following options are recommended:
- Client architecture: ShardingJDBC
- Agent architecture: MyCat or Atlas
MySQL compatible and horizontally scalable database
There are also some open source databases that are compatible with the MySQL protocol, such as:
- TiDB
- Cubrid
However, there is still a gap between its industrial quality and MySQL, and large operation and maintenance investment is required. If you want to migrate the original MySQL to a new database that can be horizontally expanded, you can consider some cloud databases:
- Ali cloud PetaData
- Ali cloud OceanBase
- Tencent cloud DCDB
NoSQL
Sharding on MySQL is a dance in shackles. In fact, many large tables themselves do not require ACID and MySQL RDBMS. Consider migrating these tables to NoSQL to completely solve the problem of horizontal scaling.
- Log, monitoring, and statistics data
- Unstructured or weakly structured data
- The transaction requirements are not strong and there is not much data associated with the operation