preface

Many people think that performance is optimized by writing code (program code or database code), but this is a big mistake. The part that really affects performance the most is already produced in the design. In many cases, the improvement that can be brought by the later optimization is only to solve some problems left by the design of the ex-wife, and the problems that can be solved are usually limited.

How to design MySQL database Schema as efficiently as possible and with as little fuss as possible will be discussed in 3 articles.

I. Efficient model design

Is the most normative necessarily the most reasonable?

In the aspect of database Schema design theory, there has always been a standardized paradigm theory regarded as “sunflower treasure book”. The database Schema designed by the paradigm theory has clear logic, clear relationship and convenient expansion. Even the amount of stored data is as little as possible, especially when the paradigm level is high, almost no redundant data can be found. In many people’s eyes, the higher the Schema level, the better the Schema design.

However, what many people ignore is the time and starting point of the theory. The normalization paradigm theory of relational database was born in the early 1970s, the most fundamental purpose is to make the database as far as possible to remove the redundancy of data, keep the data consistent, so that the data modification is simple.

In fact, as far as possible to remove the data redundancy is not only in order to let us query the same amount of data to return more records, there is also a very important reason is that at the time, the data storage space is extremely expensive, and the capacity of storage devices are also very small, this point at the hardware store equipment such rapid development today, Size is no longer too much of an issue.

In the paradigm theory, data consistency and simple data modification are guaranteed by adding various constraints to the database, which is a very resource-consuming thing in itself.

Therefore, for performance-based database Schema design, we cannot completely rely on canonical paradigm theory as the sole guide. In the design process, we should start from the actual needs and take performance improvement as the fundamental goal to carry out the design work. In many cases, in order to improve performance as much as possible, we must do anti-paradigm design.

① Moderate redundancy – Let Query reduce Join as much as possible

Readers familiar with MySQL optimizer may know that MySQL optimizer, although it is claimed to use the new generation of optimizer technology to achieve very good, but because of the current data statistics collected by MySQL is not very much, so the performance is not particularly satisfactory. Nor is it as smart as MySQL’s official hype. While dealing with common Join generally can get more efficient execution plan is intelligent, but when faced with some queries or more complex joins, it is easy to appear not too reasonable implementation plan, a lot of time on each table access sequence of choice is not suitable, cause the complex Query execution efficiency is low as a whole.

Therefore, in order to optimize our Query execution plan as much as possible, the most direct and effective way is to minimize joins. To reduce joins, we inevitably need to achieve this through the redundancy of table fields.

In scheme 1, the group_message table only stores the ID information of the information publisher, while the nick_name information of the information publisher is added to the group_message table after redundancy optimization and stored as author.

Query and execution plan to implement list function before optimization (group_message_bad is the table before optimization, group_message is the table after optimization) :

sky@localhost : example 09:13:41> explain
-> SELECT t.id, t.subject,user.id, user.nick_name
-> FROM (
-> SELECT id, user_id, subject
-> FROM group_message
-> WHERE group_id = 1
-> ORDER BY gmt_modified DESC LIMIT 1.10
-> ) t, user
-> WHERE t.user_id = user.id\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: <derived2>
type: system
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
Extra:
*************************** 2. row ***************************
id: 1
select_type: PRIMARY
table: user
type: const
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 1
Extra:
*************************** 3. row ***************************
id: 2
select_type: DERIVED
table: group_message
type: ALL
possible_keys: group_message_gid_ind
key: group_message_gid_ind
key_len: 4
ref:
rows: 1
Extra: Using filesort
Copy the code

Query and execution plans optimized for list functionality:

sky@localhost : example 09:14:06> explain
-> SELECT t.id, t.subject, t.user_id, t.author
-> FROM group_message t
-> WHERE group_id = 1
-> ORDER BY gmt_modified DESC LIMIT 1.10\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t
type: ref
possible_keys: group_message_gid_ind
key: group_message_gid_ind
key_len: 4
ref: const
rows: 1
Extra: Using where; Using filesort
Copy the code

It can be seen from the execution plan before and after optimization that there is a big difference between the two. Before optimization, two tables (group_message and user) must be retrieved to obtain the results, while after optimization, only one table (group_message) can be retrieved. Because we have redundant “author” information to group_message.

From the perspective of database paradigm theory, such a design is not reasonable. User nickname data may be inconsistent between the user table and group_message table. Each time a user’s nickname is updated, the data in both tables needs to be updated, and more logic needs to be processed in the application in order to make the two data as consistent as possible. However, this redundancy is valuable from a performance perspective, and while our data update logic is complex, we should also consider how many updates will actually occur on user nicknames when considering the added cost of updates. We need to consider the overall performance of a system, not the performance of individual behaviors within the system. Just like the nickname data in the example, although the cost of update has increased, the efficiency of query has improved, and the frequency of query in the example is much greater than the frequency of update. In fact, it is a strategy often used in the performance optimization of our system to obtain greater performance gains through the cost input of a few operations.

In most application systems, there are a lot of data that are frequently queried but rarely updated, similar to the above example. In many cases, Schema design that blindly pursues formal theory is not suitable for high-performance systems. I personally think that the essence of the standardization theory of database is the simplification of the concept, although the tables in the standardized database are generally smaller, so that the least related columns in the table. Although this may enhance the maintainability of the database in some cases, the system may need to use complex Join to achieve the query retrieval of some data, which is bound to cause the low performance of query retrieval. If we split Join and implement Join logic in the application through multiple simple queries, the network overhead will be huge.

② Large field vertical split – Summary table optimization

In fact, we also used another optimization strategy in the above example, which is the “large field vertical split” strategy. The large field vertical split strategy is the exact opposite of the moderately redundant strategy introduced earlier. The moderate redundancy strategy is to take fields from other tables and store a copy of their own data, while large field vertical split is simply to split their own fields into a separate (separate) table.

Maybe many readers will be confused, we just analyzed the other field to put in their own table why now want to separate their own field? Isn’t that a little paradoxical?

In fact, there is no contradiction. Previously, we added other people’s fields, because we need to use this field in many queries, in order to reduce the performance consumption brought by Join. When we take out large fields, we take them out when we don’t need them in most queries. And, before we take it out, we will certainly do a thorough evaluation and comparison before we make a decision to split it off.

So what exactly are the appropriate fields to split out of a table?

The first is definitely the big field. Why is that? The reason is simply because he is big. Large fields usually store some long Detail information, such as the content of the article, the content of the post, the introduction of the product, and so on.

The second is that it is accessed significantly less frequently than other fields in the table. Because large fields store more content, in most cases, they account for more than 80% of the whole record, and the format of data in database data files is generally stored in units of one record. That is, if we want to query certain fields of some record, the database does not only need to access the fields we want to query, but needs to read all the other fields (except when the whole query can be done in the index), and cannot read only a few fields of the data we need. As a result, we have to read a lot of irrelevant data, including large fields. Because large fields take up a large proportion of space, the natural WASTE of IO resources is very large.

In such a scenario, we need to separate the large field from the original table and store it in a separate table, so that we can greatly reduce THE IO access when accessing other data, thus greatly improving performance.

Some people may wonder that although the efficiency of accessing other fields is improved after moving out, when we need the information of large fields, we inevitably need to implement it through Join, and the processing efficiency may be greatly reduced after using Join. In fact, this is a reasonable concern, which is the second factor we need to consider before splitting out the large fields, the access frequency factor. As we explained earlier, in deciding whether to split, in addition to “large”, we also need “low frequency”, of course, here “low frequency” is only “relative frequency”. In addition, the relationship between the two tables after the split is completely determined one-to-one correspondence, and the use of Join does not have a great impact on performance.

Do we need to remove the other fields as well as remove the big fields? In fact, if we have determined that large fields need to be separated from the main table, for other fields, as long as the access frequency is the same as large fields and much lower than other fields in the table, they can be separated from the large fields at the same time.

In fact, in some cases, we don’t even need large fields to do vertical splashes. In some scenarios, most of the fields in the table are rarely accessed, but some of the fields are accessed very frequently. For this type of table, vertical splitting is also ideal for optimizing performance.

③ Horizontal split of large table – split optimization based on type

The “big table horizontal split” strategy may not be used very often in terms of performance optimization, but when used properly, it can be quite a surprise.

Let’s go straight to an example. Assuming we extend the requirements from the previous example a little bit, we want the group system administrator to be able to publish system messages and display them at the top of every page of discussion posts for each group.

Once we have this requirement, our first response is surely to add an identity column to the group_message table to hold the type of post, identifying whether it is a discussion post from a regular member or a top post from a system administrator. Each list display page is then displayed through two queries to the group_message table (one for the top message, one for a general discussion post) and then merged in the application. This result is due to the large size of the group_message table and the relatively high Query cost of querying the top information.

Let’s think about it another way:

  • First of all, there is no interaction between the top information and other discussion posts.
  • Secondly, the change of top information is less than that of other discussion posts.
  • Thirdly, the access frequency of top information is very high;
  • Finally, the amount of top information is very small compared to ordinary discussion posts;

Based on the above analysis, if we store the top information separately in a table other than the normal discussion post, there will be no additional performance cost and the cost of retrieving the top information will be reduced. Because the access frequency is very high, the cost of retrieving top information per retrieval is reduced and the savings are large. The small number and infrequent changes are ideal for using MySQL’s Query Cache, Query Cache invalidation of group_message table with regular discussion posts will prevent the user from using Query Cache.

Through the above analysis, we can easily come up with a more optimized solution to store these top messages, that is, to add a new table similar to group_message to store the top messages, which we will temporarily name top_message as follows:

sky@localhost : example 10:49:20> desc top_message;
+--------------+--------------+------+-----+---------+-------+
| Field        |     Type     | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| id 	       |    int(11)   |  NO  |     |    0    |       | 
| gmt_create   |   datetime   |  NO  |     |   NULL  |       | 
| gmt_modified |   datetime   |  NO  |     |   NULL  |       | 
| user_id      |    int(11)   |  NO  |     |   NULL  |       | 
| author       |  varchar(32) |  NO  |     |   NULL  |       |
| subject      | varchar(128) |  NO  |     |   NULL  |       |
+--------------+--------------+------+-----+---------+-------+
Copy the code

Group_id is omitted because it is global, and content can still be stored in group_message_content.

The above is just an example, and it may not be so simple in practice, but this is just an idea of how horizontal splitting of large tables can be used to optimize the overall performance of the system through Schema design. In many large applications, due to a very large amount of data, concurrent access is very high, to a single host can not support the access of a single table, often through the horizontal splitting of the large table, stored in multiple hosts in multiple databases to achieve overall scalability improvement.

(4) Statistical table-quasi-real-time optimization

Why quasi real time?

Many people may look at this optimization strategy and ask, why change the requirements to make “real-time” statistics quasi-real-time? The reason is simply that real-time statistics are too expensive to perform. Because every display (that is, every page refresh) requires a statistical calculation, it wastes a lot of duplicate resources. After making the quasi-real-time statistics, we only need to access a small amount of data at a time, and do not need frequent statistical calculation work.

Of course, not all statistics are suitable for implementation through a quasi-real-time statistical optimization strategy, and even if we wanted it, product managers would not allow it, and even if product managers wanted it, our users would certainly disagree.

What types of statistics are suitable for optimization through quasi-real-time statistics?

  • First, the accuracy of statistical information is not particularly strict;
  • Second, statistics are not very time-sensitive;
  • Thirdly, the access of statistics information is very frequent and repeated execution is frequent.
  • Finally, the amount of participation statistics is large;

Look at the list. It’s a lot. However, it is quite possible that such a statistic display might exist on the system you are maintaining. For example, the current number of online users in the system, the current total number of posts and replies in the forum system, the total number of results and pages of multi-condition large result set query pages, and the top N ranking of some virtual points.

These statistical calculations will be designed to a large amount of data, but also need a lot of computing resources, access frequency is very high. If all through real-time statistics, I am afraid that as long as the data volume is slightly larger, it will bring very large hardware resource overhead. However, inaccuracy in a short period of time will not bring too much user experience reduction. So it can be completely through the regular task procedures, not at a certain period of time after a statistical stored in a specially designed statistical table. In this way, when the statistics need to be presented, we only need to pull from the statistical results of the data. In this way, the performance of each statistic will be improved by an order of magnitude, but the overall user experience will be improved.