“This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!”

takeaway

Starting today, Little K will use the mining community as the application scenario of MySQL case, and analyze the implementation details of MySQL in detail.

As many of you know, there’s a “follow” and “follow” feature on the Diggers’ homepage, especially the “follow” feature. When you see a notification that someone is following you, you might click on the “follow” list and say, “Hey, who’s following me today?” A little excited, ha ha ha!

Today, we will take the “followers” function as an example to see how we implement this function in SQL. What is the execution principle behind the statement?

Let’s first look at how to implement the “followers” function in SQL:

SELECT user.user_name, user.avatar, user.position, user.company FROM user LEFT JOIN t_user_relation ON user.user_id = t_user_relation.follow_user_id WHERE t_user_relation.followed_user_id = 10008
Copy the code

SELECT * from user_name where user_id=10008 SELECT * from user_name where user_id=10008 SELECT * from user_name where user_id=10008 SELECT * from user_name where user_id=10008 SELECT * from user_name where user_id=10008 SELECT * from user_name where user_id=10008 Select * from ‘t_user_Relation’ where ‘follower’ = ‘id’ where ‘follower’ = ‘id’ where ‘follower’ = ‘ID’ where ‘id’ = ‘id’ where ‘id’ = ‘id’ where ‘id’ = ‘id’ where ‘id’ = ‘id’ where ‘id’ = ‘id’ where ‘id’ = ‘id’ where ‘id’ = ‘id’ This is done by looking up the user table with the follower user ID. However, I’m just using this example to start today’s topic of joint table queries.

In the article “Join Query depth optimization – an unknown new method” I explained several strategies of Join query, through this article, you should have a certain understanding of the basic process of Join query, then, combined with the above case, then, If MySQL executes this statement using the Index nested-loop Join policy, performance can be guaranteed because the query can match the Index.

However, in this case, only two tables, user and T_user_Relation, are associated. In some scenarios, such as report queries, 5, 6, 10 or more tables may have to be associated to obtain the desired data due to cross-business domain reasons. In this case, the query of join tables may become slow. So how many tables can I associate with a statement so that it doesn’t execute too slowly?

In the two articles “Why MySQL Chooses To Execute Plan A over Plan B(part 1)” and “Why MySQL Chooses To Execute Plan A over Plan B(Part 2)”, I explained in detail the optimization strategy based on MySQL cost model: MySQL compares the execution costs of multiple potential Qeps (query execution plans) and selects the least costly to execute the statement.

MySQL will compare the cost of the two solutions based on the cost model. In the end, MySQL will compare the cost of the two solutions based on the cost model. Select the lowest cost to execute the statement, that is, select the lowest cost driver relationship.

Since MySQL is through comparing the different execution cost driver relationship to select the driver table, then, a table, the more the more means that different combination of the driven relationship (the worst two combinations of n – 1, n) for the associated table number), comparison of various combinations of execution cost at the higher price, and then the corresponding query execution will slow down.

Since one of the reasons for slow statement execution is the cost analysis of the driver relationship, let’s take a look at the process of cost analysis of the driver relationship.

Before going into the process of cost analysis, let’s take a look at some of the core structures associated with cost analysis. Therefore, here, I use the following sentences from the Introduction to illustrate these structures:

Join

After parsing, the structure of a statement will be stored in the Join structure, including select column, from table, where, groupby, orderby and other information.

  • Best_ref: The execution plan with the lowest current cost during the comparison of execution plan costs. The JOIN_TAB array shows the execution of each phase of the plan. The green arrow in the figure above shows the execution of each phase of the plan. Select * from JOIN_TAB where user -> t_user_relation -> user -> t_user_relation -> user -> t_user_relation

    • JOIN_TAB: contains the tables used for execution of a phase and the execution cost of the phase. It mainly contains the following core attributes:

      • Table_ref: Table information used in a phase. As shown in the figure above, the execution plan is user -> T_user_Relation. Therefore, table_ref in the first JOIN_TAB indicates the information about the user table during the query stage. In the second JOIN_TAB, table_ref is used to query information about T_user_Relation.

      • Dependent: The table used by a phase, the table on which it depends. MySQL > select * from t_user_Relation where t_user_relation = followed_user_id; MySQL > select * from T_user_relation where t_user_relation = followed_user_id; MySQL > select * from T_user_relation where t_user_relation = followed_user_id; There’s no dependency between user and T_user_Relation.

        However, if I change the statement to this:

        SELECT user.user_name, user.avatar, user.position, user.company FROM user LEFT JOIN t_user_relation ON user.user_id = t_user_relation.follow_user_id WHERE user.user_id = 10008
        Copy the code

        T_user_relation t_user_Relation t_user_relation t_user_relation t_user_relation t_user_relation t_user_relation t_user_relation t_user_relation t_user_relation t_user_relation t_user_relation t_user_relation t_user_relation t_user_relation In the figure above, the dependent table in the second JOIN_TAB is the user table, indicating that the T_user_Relation table queries depend on the fields in the USER table.

      • Read_time: Table read time used by a phase. As shown in the preceding figure, the execution plan is user -> T_user_relation. Therefore, read_time in the first JOIN_TAB indicates the time to read the user table during the query stage. In the second JOIN_TAB, read_time indicates the time for reading T_user_Relation during the query stage.

  • Positions: positions indicate the execution plan whose cost is currently being calculated during the process of comparing the execution plan cost. For example, positions in the preceding figure is user -> T_user_Relation, indicating that the user table is queried first, and then the T_user_Relation table is queried. Inside it is an array of positions. Each POSITION represents a phase in the execution plan in the order of the execution plan.

    • POSITION: As shown in the figure above, the second POSITION represents the stage of querying T_user_Relation, which is described by the ST_position structure. Let’s look at the st_position structure:
      • Prefix_rowcount: Total number of rows scanned before and including the current phase. For example, the second POSITION in the figure above is the query T_user_Relation stage. Before this stage, there was only one stage to query the user table. Therefore, Prefix_rowcount in POSITION is number of rows scanned for T_user_Relation + Number of rows scanned for user =8.
      • Prefix_cost: Total execution cost before and including the current phase. For example, the second POSITION in the figure above is the stage of querying T_user_Relation table. Before this stage, there is only one stage of querying the user table. Therefore, prefix_cost in this POSITION is the cost of querying T_user_Relation + cost of querying user =12.2.
      • Read_cost: execution cost of the current phase. For example, the second POSITION in the figure above is the query T_user_Relation stage, so the read_cost in this POSITION is 9.6, indicating that the cost of querying T_user_Relation is 9.6.
      • Rows_fetched: Indicates the number of scanned rows in the current phase. For example, the second POSITION in the figure above is the query T_user_Relation stage, due to the conditiont_user_relation.followed_user_id = 10008The matching index is queried. Therefore, rows_fetched in the POSITION is 0, indicating that 0 rows are scanned in the T_user_Relation table according to this condition.
      • JOIN_TAB: Same as the JOIN_TAB structure above. As shown in the figure above, JOIN_TAB in ST_position points to the second JOIN_TAB in the JOIN_TAB array, indicating the related tables used in the query t_user_Relation stage and the execution cost of this stage.
  • Best_read: The lowest current execution cost of the entire statement during the comparison of execution plan costs. As shown in the figure above, assume that the current minimum execution cost of the statement in the Guide is 12.2.

Driver list selection

After explaining the core data structure, you may have a question: Since the relationship between st_position and JOIN_TAB in the JOIN_TAB array is 1:1, why does MySQL not put st_position into JOIN_TAB? Before you rush, let’s take a look at this picture:

Have you found that this is not a graph depth traversal! That’s right!

  • T1 -> T2 -> T4: query tables T1, T2, and T4 in sequence:
    • t1 -> t2
      • Analyze T1 and find that the cost of T1 is 2.6. Plug this cost into T2, which is 2.6 on the first red arrow in the figure.
      • By analyzing T2, it is found that the cost of T2 is 9.6. Add in the cost of 2.6 + T2 cost of 9.6, and the total cost of T1 -> T2 is 12.2. Bring 12.2 into T4. 12.2 on the second red arrow in the figure.
    • t2 -> t4
      • By analyzing T4, it is found that the cost of T4 is 1.6. Add in cost 12.2 + t4 cost 1.6, and the total cost of T1 -> T2 -> T4 is 13.8.
  • T1 -> T3 -> T4: indicates that t1, T3, and T4 tables are queried in sequence. Similarly, the total cost of T1 -> T3 -> T4 is 12.8.

We compare this process with the above Join structure and find that st_position can describe the relationship between the two nodes in the above depth traversal process. Therefore, MySQL designs ST_position separately to represent the cost relationship between the two tables in the traversal path of the execution plan. Prefix_cost in ST_position is the total cost before the current node (including the current node), and read_cost is the query cost of the current node.

After talking about the core structure, we can see how MySQL compares the cost of driving relationships. I mentioned deep traversal above, and smart friends have guessed it, yes! MySQL also uses deep traversal, which is algorithmically called greedy search, when comparing different driver relationship costs. Thus, the passage in The Introduction, the process of its greedy search goes like this:

MySQL > select * from user where t_user_Relation = user where T_user_Relation = user where t_user_Relation = user;

  1. User -> T_user_relation: the query cost of user table is 2.6, and the query cost of T_user_relation is 9.6. Therefore, the total cost is 2.6 + 9.6 = 12.2.
  2. T_user_relation -> user: the query cost of t_user_relation is 2.0127 and the query cost of user is 6. Therefore, the total cost is 2.0127 + 6 = 8.0127.

MySQL > select T_user_Relation to drive the user table. For A detailed query cost analysis process, read the two articles “Why MySQL Chose Plan A over Plan B(1)?” And “Why did MySQL choose to execute Plan A over Plan B?” .

So, now that we know how MySQL compares to drive relational costs, going back to the problem of too many associated tables slowing down queries, we know that if there are too many associated tables, then MySQL has to do more combinations of tables in order to process queries, and do greedy searches of various combinations to compare their costs. For MySQL, this is bound to affect query performance.

Pruning process

For this reason, MySQL introduced prune_level to reduce the number of search traversals. Let’s take a look at this reduction process. Suppose a statement is associated with four tables: T1, T2, T3, t4:

In this case, suppose MySQL selects t1 as the driver table to calculate the query cost.

  1. T1 – > T2, that is, T1 table drives T2 table first, and the query cost of T1 is 2.6, the query cost of T2 is 1.6, and the sum of the two is 4.2.
  2. T1 -> t3, that is, T1 table drives T3 table. The query cost of T1 is 2.6, and the query cost of T3 is 8.6. The sum of the two is 11.2prune_levelIf the variable is set to 1, the path is pruned. Therefore, MySQL finds that t1 -> t3 costs 11.2 more than T1 -> T2 costs 4.2. T1 -> T3 traverses the branch end. As shown in the figure above, t3 in step1 is crossed, indicating that t1 -> t3 traversal ends the branch. Bring 4.2 into T3 and T4. That is 4.2 on the second red arrow and 4.2 on the second green arrow.
  3. T2 -> t3, that is, start to drive T3 from T2, and get the query cost of T3 is 5.6, plus the bring in cost of 4.2, 4.2 + 5.6 = 9.8.
  4. T2 -> T4, that is, start to drive T4 from T2, and get the query cost of T4 is 1.6, plus the carry cost of 4.2, 4.2 + 1.6 = 5.8. Due to theprune_levelT1 -> t2 -> t3 costs 9.8 more than T1 -> T2 -> T4 costs 5.8, t1 -> T2 -> t3 traverses the end of the branch and no other tables are traversed from T2. Focus only on the traversal branch of T1 -> T2 -> T4. As shown in the figure above, T3 in Step2 is crossed, indicating that t1 -> T2 -> T3 traverses the branch at the end. Let’s put 5.8 into T3. That is 5.8 on the third green arrow in the figure.
  5. T4 -> t3, that is, starting from T4 to drive T3, the query cost of T3 is 5.6, plus the carry cost of 5.8, 5.8 + 5.6 = 11.4.
  6. Therefore, the driver relationship is obtained: T1 -> T2 -> T4 -> T3, with a total cost of 11.4.

The prune_level variable prune_level reduces the number of greedy searches. The prune_level variable reduces the number of greedy searches. The prune_level variable reduces the number of greedy searches, and the prune_level variable reduces the number of greedy searches.

In the figure, the total cost of the traversal branch T1 -> T2 -> T3 -> T4 is 10.4, which is significantly smaller than the total cost of T1 -> T2 -> T4 -> T3, which is 11.4. Therefore, the prune_level variable is 1, and the branch with lower cost may be ignored when pruning function is enabled. Therefore, In join table queries, if the number of tables is small, the prune_level variable of 1 is not appropriate.

Pruning tuning

Fortunately, MySQL provided me with parameters to adjust the value of this variable. The prune_level variable defaults to 1, so we can set it to 0 by executing the following command to disable pruning.

set optimizer_prune_level = 0;
Copy the code

Traversal depth tuning

If the number of join tables is very large, the amount of temporary memory generated by recursive calls will be very large, which is not acceptable for memory sensitive MySQL. Therefore, MySQL sets the maximum depth of traversal to 62 by default.

Followed by a problem, if the connection table number is 100, so, start with a table depth traversal other tables, will touch the maximum depth of MySQL limitation, we can imagine, the space complexity of O (62) affect the performance of query is huge, so we must hope can down the traversal of the maximum depth threshold.

So, what is the appropriate threshold to adjust to? MySQL is smart enough to give us a threshold of 0, so what does that mean? 0 indicates that MySQL itself has dynamically calculated the maximum traversal depth for us. The calculation rules are as follows:

  1. If the number of tables is less than or equal to 7, then the maximum traversal depth is the number of tables +1
  2. If the number of tables is greater than 7, then the maximum traversal depth is 7

Did you notice that MySQL uses 7 as the threshold for dynamically calculating the maximum traversal depth? However, given this dynamically computed threshold of 7, MySQL probably recommends that we do not traverse the table more than 7. PS: Of course, the most reasonable way is dynamic statistics traversal depth performance, and then, according to the statistical results to determine the maximum traversal depth, MySQL source code do such TODO, this is to give me the opportunity to play, ha ha ha! Just kidding.

Therefore, this answers the question in the title of my article: What is the maximum number of Join tables a Join query should not exceed? The answer is 7. If the number of tables we join is less than or equal to 7, then we must start at one table and go through the other tables at a depth of no more than 7.

Since the threshold for the maximum traversal depth can be adjusted to 0, how do we adjust it? The specific adjustment methods are as follows:

set optimizer_search_depth = 0;
Copy the code

conclusion

Finally, let’s wrap up today’s content: Join, the core data structure for table cost analysis, and the process that drives table selection.

At the same time, I also provided two parameter tuning:

scenario Parameter tuning
If the number of table joins is not large Set optimizer_prune_level = 0; Turn off the pruning function
If the number of table joins is less than or equal to 7 You are advised to set optimizer_search_depth to 0. Let MySQL compute the maximum traversal depth dynamically

What is the maximum number of Join tables in a Join query?

MySQL recommends that the number of Join tables in a Join query not exceed 7.

I am k, if you think this article is good, remember to like + follow oh!

Of course, if you have any questions about this article, please feel free to ask in the comments section.