MySQL Performance Optimization (Collector's Edition)

“This is the 12th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

When it comes to MYSQL performance tuning, most of the time the goal is to make our queries faster. A query action is composed of many links, each link will consume time, we want to reduce the time consumed by the query, we must start from each link.

Interviewer: How does MYSQL execute a query?

Configuration optimization

The first step is to connect the client to the server. What performance problems may occur in this connection? The application may fail to obtain connections because the number of server connections is insufficient. Mysql: error 1040: Too many connections The insufficient number of connections can be solved from two aspects.

From the server

We can increase the number of connections available to the server.

If multiple applications or multiple requests are accessing the database at the same time and the number of connections is insufficient, we can:

Modify configuration parameters to increase the number of available connections and change the size of max_connections:

show variables like 'max_connections'; -- Change the maximum number of connections when multiple applications are connected
Copy the code

Or, alternatively, release inactive connections in time. The default timeout for both interactive and non-interactive clients is 28800 seconds, 8 hours. We can lower this value.

show global variables like 'wait_timeout'; Release inactive connections in a timely manner. Be careful not to release connections that are still in use in the connection pool
Copy the code

From the client

You can reduce the number of connections taken from the server. At this point we can introduce connection pooling to achieve connection reuse.

ORM layer (MyBatis comes with a connection pool); Or use dedicated connection pooling tools (Druid, Hikari by default in Spring Boot 2.x, DBCP and C3P0).

In addition to properly setting the number of server connections and the size of the client connection pool, what other solutions can we use to reduce the number of client connections to the database server? Let’s talk about optimization in architectural detail.

Structure optimization

The cache

In the case of a very large number of concurrent applications, if there is no cache, it will cause two problems: on the one hand, it will put a lot of pressure on the database. On the other hand, at the application level, the speed at which data can be manipulated is also affected. We can solve this problem with a third-party caching service, such as Redis.

Running a separate cache service is an architectural optimization.

What other optimizations can be made at the architectural level to reduce the read-write burden on a single database server?

A master-slave replication

If a single database service can not meet the access requirements, then we can do the database cluster scheme.

Clustering is bound to face the problem of data consistency between different nodes. If I read and write to multiple database nodes at the same time, how can I keep the data of all the nodes consistent?

In this case, we need to use replication technology. The node to be replicated is called master and the node to be replicated is called slave.

How is master-slave replication implemented? As we mentioned earlier, update statements log a binlog, which is a logical log. With this binlog, the secondary server retrieves the master’s binlog file, parses the SQL statement, and executes it on the secondary server to keep the master and slave data consistent.

Interviewer: Talk about redo logs and binlogs in mysql

There are three main threads involved: binlog threads, I/O threads, and SQL threads.

Binlog thread: Responsible for writing data changes on the primary server to the Binary log.
I/O thread: is responsible for reading binary logs from the primary server and writing to the Relay log of the secondary server.
SQL thread: Is responsible for reading the relay log, parsing out the data changes that have been made by the primary server and replaying them in the secondary server.

Below are the three threads involved in master-slave replication.

Reading and writing separation

With master-slave replication, data is only written to the master node and read requests are shared with the slave node. We call this scheme read-write separation.

Read/write separation improves performance because:

Master and slave servers are responsible for their own read and write, which greatly alleviates lock contention.
Secondary servers can use MyISAM to improve query performance and save system overhead.
Increase redundancy and improve availability.

Read/write separation can reduce the access pressure of the database server to a certain extent, but special attention should be paid to the problem of data consistency between master and slave.

After the master-slave replication, if the data stored by a single master node or a single table is too large, for example, a table has hundreds of millions of data, the query performance of a single table will still decline, so we need to further split the data of a single database node, which is the sub-database sub-table.

Depots table

Here we take a mall system as an example to explain how the database is evolving step by step.

Single application single database

As shown in the figure above, the mall system includes the home page Portal template, user module, order module, inventory module, etc. All modules share a database, and there are usually many tables in the database. Since the user base was small, this architecture worked perfectly in the early days.

Multiple applications and single database

As the system keeps iterating and updating, the amount of code gets bigger and bigger, the architecture becomes more and more bloated, and the pressure of system access increases gradually, the system split is imperative. In order to ensure smooth business, system architecture reconstruction is also carried out in several stages.

In the first stage, the single structure of the mall system is divided into sub-services according to functional modules, such as Portal service, user service, order service, inventory service, etc.

As shown in the figure above, multiple services share a database so that the underlying database access logic is left untouched and the impact is minimized.

Multiple applications and multiple databases

With the increase of business promotion, database has finally become the bottleneck, this time multiple services to share a database is basically not feasible. We need to separate the tables related to each service to create a separate database, which is actually a repository.

A single database can support only a limited amount of concurrency, so multiple libraries can be divided into services without competition and improve the performance of services.

As shown in the figure above, separating multiple small databases from a large data set, one for each service, is a necessary operation for the system to develop to a certain stage.

Microservice architecture is the same, if only split applications not split database, can not solve the fundamental problem, the whole system is also easy to reach the bottleneck.

table

If the system is in the stage of high-speed development, take the mall system, a day order volume may be hundreds of thousands of, the order table growth in the database is particularly fast, growth to a certain stage of the database query efficiency will appear a significant decline.

Therefore, when the single table data increment too fast, the industry spread is more than 5 million data volume to consider the table. Of course, 5 million is just a rule of thumb, so you can make a decision based on the actual situation.

Take horizontal splitting, where each table is split into multiple subtables that reside in the same database. For example, the following user table is split into user 1 and user 2 tables.

Splitting a table into subtables in a database solves the problem of single-table query performance to some extent, but there is also a problem: single-database storage bottlenecks.

So more often than not, the industry is splitting child tables into multiple databases. For example, in the following figure, the user table is split into two sub-tables, each of which exists in a different database.

Table splitting is to reduce the size of a single table and solve the performance problem caused by the amount of data in a single table.

complexity

It does solve a lot of problems, but also brings a lot of complexity to the system.

Associated queries across libraries

Before the single database is separated into tables, we can easily use join operation to associate multiple tables to query data, but after the database is divided into tables, the two tables may not be in the same database, how to use JOIN?

There are several solutions:

Field redundancy: Put the fields to be associated into the main table to avoid the join operation.
Data abstraction: data aggregation through ETL, etc., to generate new tables;
Global tables: For example, some basic tables can be placed in each database.
Application layer assembly: the basic data is checked out, and the assembly is calculated by the application program;

Distributed transaction

A single database can be handled by local transactions, while multiple databases can only be handled by distributed transactions.

Common solutions include: Reliable message (MQ) based solutions, two-phase transaction commit, flexible transaction, etc.

A distributed ids

Mysql > select * from db where id = 1; Mysql > select * from db where id = 1; Mysql > select * from DB where id = 1;

Common distributed ID solutions are:

Use globally unique IDS (GUids);
Specify an ID range for each shard;
Distributed ID generators (such as Twitter’s Snowflake algorithm).

Multiple data sources

After dividing the database and table, it may be faced with obtaining data from multiple databases or sub-tables. The general solution is client adaptation and proxy layer adaptation. Middleware commonly used in the industry include:

Shardingsphere (formerly Sharding-JDBC)
Mycat

summary

If you have a database problem, don’t worry about dividing the database and tables. See if it can be solved using conventional methods first.

Separate tables can bring huge complexity to the system, so it is not recommended to use them in advance. As a system architect, you can make your system flexible and extensible, but don’t over-design and over-design.

Query performance optimization

Use Explain for analysis

Explain is used to analyze SELECT query statements, and developers can optimize query statements by analyzing Explain results.

The more important fields are:

Select_type: specifies the query type, including simple query, combined query, and sub-query.
Key: indicates the used index.
“Rows” : indicates the number of rows scanned.

Optimize data access

1. Reduce the amount of requested data

Return only necessary columns: it is best not to use SELECT * statements.
Return only necessary rows: Use the LIMIT statement to LIMIT the data returned.
Caching repeated queries: Using caching can avoid queries in the database, especially if the data to be queried is frequently repeated, the performance improvement of caching can be significant.

2. Reduce the number of lines scanned on the server

The most efficient way to override queries is to use indexes.

Refactoring the query mode

1. Shard big query

A large query, if executed at once, can lock up a lot of data at once, consume the entire transaction log, exhaust system resources, and block many small but important queries.

2. Decompose the large join query

The benefits of breaking a large join query into a single table query for each table and then associating it in the application are:

Make caching more efficient. For join queries, if one of the tables changes, the entire query cache becomes unusable. When multiple queries are decomposed, even if one table changes, the query cache for the other tables can still be used.
Split into multiple single-table queries, whose cached results are more likely to be used by other queries, thereby reducing the number of redundant record queries.
Reduce lock contention;
Connecting at the application layer makes it easier to split the database, making it easier to achieve high performance and scalability.
The query itself may also be more efficient. For example, IN the following example, using IN() instead of join queries allows MySQL to query by ID order, which may be more efficient than random joins.

SELECT * FROM tag
JOIN tag_post ON tag_post.tag_id=tag.id
JOIN post ON tag_post.post_id=post.id
WHERE tag.tag='mysql';
Copy the code

SELECT * FROM tag WHERE tag='mysql';
SELECT * FROM tag_post WHERE tag_id=1234;
SELECT * FROM post WHERE post.id IN (123.456.567.9098.8904);
Copy the code

The index optimization

1. Separate columns

The index column cannot be part of an expression or a parameter to a function when performing a query, otherwise the index cannot be used.

For example, the following query cannot use the index of the ACtor_ID column:

SELECT actor_id FROM sakila.actor WHERE actor_id + 1 = 5;
Copy the code

2. Multi-column index

Multi-column indexes perform better than single-column indexes when multiple columns are required for conditional queries. For example, in the following statement, it is best to set actor_id and film_id to multi-column indexes.

SELECT film_id, actor_ id FROM sakila.film_actor
WHERE actor_id = 1 AND film_id = 1;
Copy the code

3. Order of index columns

Put the most selective index columns first.

Index selectivity is the ratio of non-duplicate index values to the total number of records. The maximum value is 1, at which point each record has a unique index corresponding to it. The higher the selectivity, the higher the distinction between each record and the higher the query efficiency.

For example, customer_id is more selective than staff_id in the results shown below, so it is best to place the Customer_id column in front of the multi-column index.

SELECT COUNT(DISTINCT staff_id)/COUNT(*) AS staff_id_selectivity,
COUNT(DISTINCT customer_id)/COUNT(*) AS customer_id_selectivity,
COUNT(*)
FROM payment;
Copy the code

   staff_id_selectivity: 0.0001
customer_id_selectivity: 0.0373
               COUNT(*) :16049
Copy the code

4. Prefix indexes

For BLOB, TEXT, and VARCHAR columns, you must use a prefix index that indexes only the first part of the character.

The selection of prefix length depends on index selectivity.

5. Overwrite indexes

The index contains the values of all the fields to be queried.

Has the following advantages:

Indexes are typically much smaller than the size of a data row, and reading only indexes can greatly reduce data visits.
Some storage engines, such as MyISAM, only cache indexes in memory, and the data is dependent on the operating system for caching. Therefore, just accessing the index can be done without using a system call (which is usually time-consuming).
With the InnoDB engine, there is no need to access the primary index if the secondary index overwrites the query.

Mysql > select * from ‘index’ where ‘index’ = ‘index’;

MySQL > select * from ‘MySQL’;

How to drop an index in MySQL storage engine?

The storage engine

Storage engine selection

Select different storage engines for different service tables. For example, MyISAM is used to query service tables with multiple insert operations. Temporary data is stored in Memory. Regular concurrent large update of many tables using InnoDB.

Field definition

Rule: Use the smallest data type that can store data correctly. Select the appropriate field type for each column.

Integer types

TINYINT, SMALLINT, MEDIUMINT, INT, BIGINT use 8, 16, 24, 32, 64 bit storage space respectively. Generally, the smaller the column, the better. The number in INT(11) simply specifies the number of characters to be displayed by the interactive tool, and is meaningless for storage and computation.

Character types

Varchar is more space-efficient with variable length, but for vARCHAR fields, one byte is required to record the length. Use char, not varchar, for fixed lengths.

Don’t use foreign keys, triggers, views

Reduced readability; Affect the performance of the database, the calculation should be given to the program, the database to concentrate on storage; Data integrity should be checked in the program.

Large file storage

Don’t use a database to store images (such as Base64 encoding) or large files;

To put files on the NAS, the database only needs to store urIs (relative paths) and configure the NAS server address in the application.

Table split or field redundancy

Split out the fields that are not commonly used to avoid too many columns and too much data.

For example, in a business system, to record all received and sent messages, the messages are in XML format and stored in BLOB or text to track and determine duplication, you can create a table for storing messages.

conclusion

What would you say if you were asked in an interview, “What dimensions would you use to optimize your database?”

SQL and index
Storage engines and table structures
Database architecture
MySQL configuration
Hardware and operating system

In addition to code, SQL statements, table definitions, schemas, and configuration optimizations, business-level optimizations cannot be ignored. A few examples:

On the Singles’ Day of a certain year, why do you do an activity to top up the balance in Yu ‘ebao and give bonus to the balance, for example, charge 300 yuan and give 50 yuan for free?

Because the payment with Yu ‘ebao or Yu ‘ebao is to record the local or internal database, while the payment with bank card requires the interface to be invoked, so the operation of the internal database is definitely faster.

On last year’s Double Eleven, why is it forbidden to check bills outside today’s date at midnight?

This is a downgrading measure to protect the core business right now.

Double Eleven in recent years, why a week in advance has the price of double Eleven day?

Pre-sale diversion.

At the application level there are also many other optimizations to minimize the strain on the database, such as limiting traffic, or introducing MQ peak clipping, and so on.

With MySQL, some companies can handle millions and millions of concurrent tasks, while others can’t handle hundreds of concurrent tasks. The key is how to use MySQL. Therefore, the slow use of the database does not mean that the database itself is slow, and sometimes to the upper level to optimize.

MySQL Performance Optimization (Collector’s Edition)