The authors introduce

Jianrong Yang, senior DBA of Competitive World, former Database expert of Sohu Changyou, Oracle ACE, MEMBER of YEP. He has nearly ten years of database development and operation experience, and currently focuses on open source technology, operation automation and performance tuning. Oracle 10G OCP, OCM, MySQL OCP certification, Shell, Java has certain skills. We have been sharing technology through wechat and blog every day for more than 1,800 consecutive days.

Recently, a business library load was much higher than usual. The most intuitive impression is that the load was up to 100%. Now, instead of multiplying by several times or exponentially, it suddenly increased by 100 times, resulting in a surge of data writes at the back end of the business, resulting in severe performance congestion.

First, the introduction of read and write separation, optimization initial results

This kind of question aroused my interest and curiosity. After communication with the business side, I learned that this business records receipt data. Simply speaking, it is just like you send a microblog and want to see how many people have read it and how many people have left messages. So there are no transactions, there are data-intensive writes, and there are clear statistical requirements.

Currently, statistics are collected every 7 minutes, and there are several types of statistics scenarios. At present, most of them are query statements of full table scan level. The current database architecture is very simple, a master slave, plus MHA high availability.

The improvement direction of the problem is to reduce the main library pressure, respectively read and write pressure. The write pressure comes from the concurrent write pressure of services, and the read pressure comes from the full table scan pressure. The CPU and I/O pressure is very high.

The solution of these two problems still has priority. First, statistical SQL leads to the bottleneck of system resources, and as a result, the original simple Insert has become slow logging SQL. In contrast, write requirements are hard requirements, while statistical requirements are auxiliary requirements, so in this scenario, communication with business side, A quick way to respond is to transfer statistical requirements from the master to the slave.

The load of the read request was transferred, and the write pressure was greatly relieved. After the application layer optimization of the business side, the overall load was relatively optimistic:

  • The monitoring load for the main library is shown below, and you can see a significant decrease in CPU load from over 90% to less than 10%. IO pressure has also dropped from nearly 100% to around 25%.

  • The monitoring load of the library is shown below, which shows a significant increase in pressure. CPU level is not obvious, the main pressure lies in the IO level, that is, the cost of scanning the full table data is very high.

This is the first step of optimization improvement, on this basis, began to do index optimization, but through comparison, found that the effect is very limited. Because of the statistical requirements from the library side, the added index can only be degraded from full table scan to full index scan, but the overall load improvement of the system is very limited, so we need to do some improvement and optimization of the existing architecture.

Plan 1:

Considering the cost of resources and usage scenarios, we temporarily adjusted the architecture to the following way: add two data nodes, and then intend to enable middleware to do distributed architecture design. For the slave library, in order to save the cost temporarily, the original server has been expanded resources, that is, the mode of single-machine multi-instance, so that the pressure of writing can be fully supported.

However, there is a potential danger in this way, that is, the middleware level of the library acts as the role of data statistics. Once performance problems occur, the pressure on the middleware is great, and the original statistics task may be blocked. At the same time, the resource bottleneck of the library is the I/O pressure except disk space, which cannot be solved by space expansion.

After further communication with business students, I found that their creation of this kind of table is a dynamic configuration method, which is difficult to implement in the current middleware scheme. And statistical requirements are becoming more opaque to the business.

Scheme 2:

Is an effective way of improving the application level to do the data routing, such as 10 business: business 1, business 2 in the first node, business 3, 5 in the second node, and so on, according to the routing configuration mode to map data source, relatively controllable, more scalable, so architectural approaches to this instead:

The most critical part of the improvement is the statistical SQL performance improvement, and if the SQL statistical performance improvement is effective, the subsequent architectural improvement will be much easier.

2. Column storage is introduced to optimize statistical performance

Subsequently, the explosive growth of business began, making the optimization of statistical requirements become the key to this optimization.

The original master database is under great read and write pressure. Through read and write separation, the read node pressure begins to surge. With the expansion of services, the demand for statistical query is increasing. For example, there might be 30 queries instead of 10, so the statistical pressure increases and the system response decreases, and the latency from the slave library starts to increase. The maximum delay was three hours, which was statistically insignificant.

I made several improvements in this regard:

  • First of all, I had a detailed communication with the business side and had a clear understanding of the business scenario. In fact, this business scenario is quite suitable for solutions like Redis, but I chose relational MySQL because of the cost and cost performance. Conclusion: Maintain the status quo for the time being.

  • For reading pressure, not only can not support the current exponential pressure, even the current situation is worrying. Business each statistical demand involves five SQL and optimization for each scenario do require trade-offs, finally reached a preliminary effect is to have five fields, the index has three, and is not controllable, once a table of data volume is too big lead to delay, delay of the whole system will get bigger, resulting in statistical demand overall collapse, So adding indexes to address hard statistical needs is a bit of a bust. Conclusion: Index optimization effect is limited, we need to find other feasible solutions.

  • For the write pressure, the sharding strategy can be used to solve the problem. The sharding strategy is different from the traditional logic, which is based on the application layer sharding, and the application end does the data routing. Thus sharding is easy to extend for the explosive growth of the business. With this layer of support, the statistical requirements of the business are migrated to the slave library, and the write pressure can be smoothly connected. Currently, the free space of the write pressure is large enough to support the exponential pressure. Conclusion: Business data routing begins to improve after the statistical pressure is reduced.

To quickly improve the situation, I wrote a script for automatic collection and management, which will periodically kill the timeout query session. But the latency is still there, the queries are still slow, and it’s hard to imagine how big the latency would be given the exponential pressure.

After a large number of comparison tests, according to the single table of 35 million data volume, 8 tables with the same data volume, 5 statistical SQL, complete statistics about 17 to 18 minutes, average each table takes about 2 minutes.

Since there is not no transaction correlation, the delay of this scenario definitely exists according to the business scenario and technical implementation. Our improvement method is to improve the statistical query efficiency, while ensuring that the pressure of the system is under control.

An effective way to do this is through data warehouse solutions. MySQL does not support database warehouses, but there are third-party solutions: One is ColumStore, based on InfiniDB; One is Infobright, but there are other large solutions, such as Greenplum’s MPP solution. ColumnStore’s solution is similar to this MPP solution, requiring distributed nodes, so Infobright is more lightweight in terms of resources and architecture.

Our table structure is simple, the field types are basic, and we have a lot of hands-on experience within the team.

The improved overall architecture is as follows, without affecting the original master/slave architecture:

On this basis, a data warehouse node needs to be expanded. The data volume can be further expanded as required.

The table structure is as follows:

CREATE TABLE `receipt_12149_428` (

‘id’ int(11) NOT NULL COMMENT ‘主键’,

‘userid’ int(11) NOT NULL DEFAULT ‘0’ COMMENT ‘userid ‘,

‘action’ int(11) NOT NULL DEFAULT ‘0’ COMMENT ‘action ‘,

‘readtimes’ int(11) NOT NULL DEFAULT ‘0’ COMMENT’ 表 示 ‘,

‘create_time’ datetime NOT NULL COMMENT ‘create time’

);

The exported statement looks like this:

Infobright community edition does not support DDL and DML. Infobright officially announced that it will no longer release ICE community edition and will focus on IEE development, so there will be limited support. It’s perfectly suited to our current needs.

To briefly appreciate Infobright’s strengths:

>select count( id) from testxxx where id>2000;

+————+

| count( id) |

+————+

727686205 | |

+————+

1 row in set (6.20 sec)

>select count( id) from testxxxx where id<2000;

+————+

| count( id) |

+————+

13826684 | |

+————+

1 row in set (8.21 sec)

>select count( distinct id) from testxxxx where id<2000;

+———————+

| count( distinct id) |

+———————+

1999 | |

+———————+

1 row in set (10.20 sec)

So for tens of millions of watches, this is no big deal.

I imported 35 million pieces of data into Infobright and the total execution time for the 5 queries was 14 seconds, which is a big improvement over 2 minutes. I ran a batch of queries, which used to take 18 minutes, but now takes less than 3 minutes.

Third, introduce dynamic scheduling to solve the problem of statistical delay

The introduction of Infobright solution can perfectly support the existing statistical requirements, but the difficulty is how to smooth the data flow support. We can set the flow rate, say 10 minutes, etc., or half an hour, but for now, this requires additional scripts or tools.

In the concrete landing process, I found that there are a lot of things to be done in advance.

It a:

  • For example, the first headache is full synchronization, the first synchronization must be full, how to synchronize so much data into Infobright.

  • The second, and more critical, question is how synchronization policies are set and whether they can be made more flexible.

  • The third problem is that based on the existing incremental synchronization scheme, you need to add an index to the time field. It’s a huge challenge to operate online.

Second:

According to the current business requirements, statistics can be delayed for one hour at most. If a large number of operation activities need to be carried out in the later period, more accurate data support is needed. The statistical data of half an hour should be obtained according to the existing plan.

If neither of these two major problems can be solved, it will be difficult for data transfer to be implemented. I only have one day left for this problem. Therefore, I plan to make the preliminary preparation and testing more solid, and the later access will be much smoother.

Part of the script implementation is as follows:

The script takes two input parameters, one is the start time and one is the end time. For the first full synchronization, you can set the start time earlier so that the cutoff time is fixed and logically full. In addition, during full synchronization, it is necessary to ensure that the master/slave delay is minimum or query services are temporarily stopped to make full data extraction more smooth.

Therefore, you need to make another layer of guarantee for the above script, by calculating the current time and the time of the last execution to get the time when the task can be executed. This eliminates the need for parameters in the script, which is a dynamically scheduled iterative process.

Considering that about 10 GB of data and 30 GB of logs are dropped every day, it is recommended to import Infobright from the client first.

In practice, there are more than 600 tables involved. I made a list and sorted it by data volume so that the small tables could be imported quickly, and the large tables were put last. The total data volume was about 150 GB.

The performance improvement after importing data into Infobright is also significant. A set of queries that used to last half an hour can now be completed in 70 seconds. The business experience is greatly improved. After the first synchronization, you can flexibly control the subsequent synchronization based on the actual situation. So incremental data synchronization is “manual shift” control for the time being.

In terms of the effect of the separation of the entire data architecture, the stress of the slave library is greatly reduced and the efficiency is greatly increased.

4. Importing service routes and smoothing support service expansion

That’s as good as it gets, but there’s still a problem: the current architecture can temporarily support intensive data writes, but not exponential pressure requests, and storage capacity is difficult to scale.

From what I understand, doing data routing at the business level is the best way to do it, and it’s more scalable and friendlier.

Therefore, the improvement scheme for the next layer is as follows:

Load balancing is achieved through data routing, which has obvious effects at present, and is also a controllable method for services in the future continuous expansion. The following is the pressure of IO from the library during some recent optimization periods:

After several times of problem solving, supplement and follow-up solutions, we completed the initial failure to the success of the implementation, MySQL performance expansion architecture optimization sharing has been basically completed. If there is a better way to achieve, welcome to share in the message area!

The recent hot,

Learn it now: The Python skills necessary to get a girlfriend

Some soft competency models for technical personnel to break out of the workplace

Play with different business scenarios, starting with these RabbitMQ

An overview of DevOps platforms for both traditional and Internet businesses

To understand distributed locks, have you been making the right choices?