How to do the server pressure test

This post is from blog: How to do server-side pressure testing

The content of the blog is not all my original, the writing ideas from an internal sharing, combined with numerous online reference summed up, can be regarded as a study notes.

Maybe many QA and RD students are the same as me. They don’t have a systematic knowledge of the server side pressure measurement. Their impression is that they use pressure measurement tools such as Jmeter to generate pressure on a single interface, adjust the number of threads and cycles to create different pressures, and finally calculate TPS and success rate. Although there are a lot of articles related to pressure measurement on the Internet, most of them are the entry-level use of pressure measurement tools, some are simple explanations of pressure measurement process and indicators, or just the introduction of the full-link pressure measurement capability and pressure measurement platform of several big factories. These articles are either lack of systematic exposition, or too abstract to understand, and not very friendly to students who have not had much contact with pressure measurement.

This paper tries to sort out a complete pressure measurement process from the perspective of QA, and tries to summarize more general pressure measurement ideas, so as to provide a more meaningful reference for everyone.

Pressure test background

There are many kinds of Testing. Many articles [1] on the Internet will play with the concept and put out three terms: Stress Testing, Performance Testing and Load Testing. Normally don’t need to do this to distinguish the concept of fine-grained, these three concepts I think it is can’t complete distinguish their borders, at least on the program logic is hard to do it, more differences only comes from the different pressure measuring strategy, so ignore the difference of the concept of a few call it pressure test or performance test.

Why do you need manometry

Take technical people familiar with Ali, for example, should be the best domestic pressure measurement of a large factory. In the well-known Ali Double 11 2012 activity, at midnight on November 11, 2012, errors were reported in various ali systems, such as placing orders immediately, shopping cart payment, payment system and shopping cart loss. The system showed that the success rate of transactions was less than 50%, resulting in a large number of oversold and great losses to Ali. After The Double 11 of that year, students of inventory, goods, refund and corresponding database worked overtime for two weeks day and night to deal with the problems caused by overselling, which gave users a lot of bad shopping experience.

Why such a serious problem? Because the bearing capacity of each subsystem on the whole transaction link is not clear, and the traffic that may be reached is wrongly estimated, and there is no perfect plan, the battle is defeated like a mountain.

In 2013, Ali first proposed the full-link pressure measurement scheme: on the one hand, each system of the link can know its own pressure limit; On the other hand, each system can have a clear optimization goal, understand the bottleneck of the whole link and evaluate the resource situation.

Single system manometry and full link manometry

Why is it not enough to do a single system manometry?

At the moment of the beginning of the activity, each system is facing great pressure from its own services, and there is a interdependent relationship between the systems, the single pressure measurement did not take into account the dependent link pressure is relatively large. When a system fails, faults accumulate in the link flow process, resulting in unmeasurable impacts.

Therefore, the most reliable way is to completely simulate the real scene to pressure test, through the online full link pressure test to find problems in advance.

Pressure test process

A complete pressure testing process usually consists of the following steps, as cited in the resources at the end of this article:

Setting of pressure measurement targets
Carding of pressure measurement link
Preparation of pressure test environment
Construction of pressure measurement data
Send pressure test
Bottleneck location and capacity fine tuning
Pressure test summary

Pressure measuring target

Pressure measurement function

New service, no estimated target, need to get service benchmark data through pressure measurement or find system bottlenecks for optimization
There is a clear pressure test objective, it is necessary to determine whether the service indicators are up to standard through pressure test
The normal pressure measurement can guide the direction or provide reference for the later performance optimization

Pressure measurement indicators

This section lists some common indicators. Not all indicators need to be paid attention to.

QPS: Query Per Second, the number of requests processed Per Second
TPS: Transactions Per Second, the number of Transactions processed Per Second, TPS <= QPS
RT: Response Time, equivalent to Latency RT average Latency, Pct Latency (Percentile quantile). The average value cannot reflect the real service response delay. In actual pressure measurement, indicators such as Pct90 and Pct99 are generally referred to
CPU usage: For load balancing after node downtime, the CPU usage is less than 75%
Memory usage: Memory usage. Generally, check whether the memory has spikes or leaks
Load indicator: indicates the CPU Load, not the CPU usage, but the total number of processes that are being processed or waiting to be processed by the CPU in a specified period of time. Generally, the Load is less than the number of CPU cores x 2. For details, see Links 1 and 2
Cache hit ratio: How much traffic hits the cache layer (redis, memcached, etc.)
Database time: A database is the life of a service. Most services fail because the database fails
Network bandwidth: Whether the bandwidth is a bottleneck
Interface response error rate or error log quantity

Here are the differences between QPS and TPS:

QPS generally refers to the number of queries a server can respond to per second, or abstractly interpreted as how much network traffic it can handle per second
TPS refers to a complete transaction, which may contain a series of request procedures. For example 🌰, visiting a web page is a TPS, but visiting a web page may make multiple requests to multiple servers, including text, JS, images, etc. These requests are counted as multiple QPS because they are traffic

In performance testing, the role of the average is very limited, the average represents 50% of the quantity before and after, for a sensitive performance index, what does it mean to take the average? Should 50% of users be happy with response time, but 50% of users perceive response latency? Or is it that 50% of the time system is guaranteed to be stable and 50% of the time system is an uncontrollable state?

Average response time is a metric that is only the same if your response time for each request is almost the same. Another example, the concept of per capita wealth is much sand sculpture I believe we also understand, 19 years there is a very funny news — Tencent employees average monthly salary of seventy thousand, understand the average is not reliable 😂. Below is a histogram of the response time of a system in the real world. RT is low in the first 20% of requests, but because the time is so short (the mean is pushed up either by hitting the cache or by failing requests quickly), most RTS are below the mean, which is the actual performance of the system.

So we should not look the best results, on the contrary, should control the worst result, users use bright does not guarantee that he will spread good word of mouth, but the user use angry he guaranteed to keyboard man is wanton lambaste, which is why the average does not bring enough reference, because the result of happy fool our eyes, the average losing too much information in the pressure test.

To sum up, a more scientific evaluation method should link indicators – success rate – flow together:

A xx% response is returned within XX milliseconds, with a success rate of XX %.

Following this guideline, you can get some test ideas:

Maximum throughput of the system under response time constraints (this is not strictly defined as QPS or TPS)
Under the premise of 100% success rate, regardless of the response time, the throughput that the system can withstand
Tolerates a certain failure rate and slow response with the highest throughput the system can handle (95% success rate, maximum QPS at xx ms for the first 95% of requests)
In the above scenario, time and resources should also be considered. For example, the maximum throughput duration of 10 minutes is different from the maximum throughput duration of 1 hour. Under the different duration length, the machine resources (CPU, memory, load, handle, thread number, IO, bandwidth) should be properly occupied

Target forecast

Before the pressure test is carried out, it is necessary to have a target, that is, the expected performance, the interface or system can achieve the expected performance, pressure test without purpose is a waste of manpower, the following gives several methods to estimate the target.

Historical Monitoring data

An interface that is already online and has historical monitoring data can view historical data to find peak QPS and PCT99. 🌰 If interface A is online and monitored, the stock monitoring data can be used after A major event or A long period of time.

analogy

For a new interface or an unmonitored interface, no historical data exists, but historical monitoring data exists for interfaces with similar functions. You can obtain the pressure measurement target by analogy. 🌰 Suppose last year taobao Double 11 order interface QPS= X, RT= Y, this year tmall platform integration, double 11 activity and taobao double 11 activity similar scene, also use QPS=x, RT= Y (example is not strict, understand it).

To estimate the

For a new interface or an interface that is not monitored online, there is no historical data and no data of similar functional interfaces for parameter examination. In this case, the peak value needs to be estimated. The common method is the 8/2 principle: 80% of the requests in a day will arrive in 20% of the time.

Top QPS = (total PV * 0.8)/(60 * 60 * 24 * 0.2)

RT If there are no special requirements, the default value is generally used:

Single-service single-table class, RT<100ms
Complex interface, RT<300ms
Large amount of data or call long chain interface, RT<1s

🌰-1 e-commerce second kill activity, it is estimated that 1000W people will participate at the same time, for simplicity, it is assumed that the total QPS is 1000W. Due to the different form of the front end of the second kill countdown makes the request has 2s to break up, plus nginx and other webservers made a 20% chance to reject the request strategy, so the total QPS of the single interface = 1000W / 2 * (1-0.2) = 400W /s, and the final pressure measurement target is 400W /s QPS.

🌰-2 e-commerce all day low buying activities, dragon bao Dao, click on the delivery, a knife 99 level, EMMMmm off topic. According to the 8/2 principle, it is estimated that a total of 4h during lunch break (12-1) and after work in the evening (7-10) is the traffic peak, and the estimated interface peak QPS = active all-day interface PV/(4*3600s).

other

In addition to the situation mentioned above, there are certainly some interfaces that we cannot start with, no reference, no prediction, no historical data. At this time, we can only do it bit by bit, slowly raise the pressure while collecting data, and finally get the optimal processing capacity of the interface.

Pressure test preparation

Pressure test scenarios

The pressure test is a pressure test with a purpose, that is to say, it is not just to find some random interfaces to send a pressure test, and all the interfaces of the pressure test are also impossible or meaningless, so the priority of the pressure test, so it is very important to comb the pressure test scene. The high quality scenarios mainly include the following:

High-frequency business scenario (Toutiao home page drop-down refresh)
Key business scenarios with low frequency of use and serious problems (wechat account login)
High performance consumption scenario (Taobao order)
There have been problematic scenes

The pressure test is divided into single-interface pressure test and scenario-based pressure test. The former is simpler, while the latter is a mixture of multiple interfaces to form a service scenario. The two methods are the same.

QA must be aligned with RD when combing scenarios, and the RD responsible person of different interfaces, interfaces to be tested, system performance status, and testing objectives must be determined. When determining the pressure test target of each interface, consider whether the pressure test object is single instance, single machine room or cluster. In terms of details, it is necessary to confirm whether the pressure test is performed on a single interface or scenario, the traffic ratio and priority of each interface, and whether sufficient pressure is required to trigger further o&M capabilities such as automatic expansion or degradation of the system.

Pressure test environment

After combing out the manometry scenario, verify that the manometry link is complete or meets expectations. From one service to another, does every service on the link need to be pressed? Have downstream services such as audit security been considered? Will dirty data generated during pressure measurement affect online data? May also be refined to a specific downstream service does not participate in the pressure test, how to deal with it? For all the above problems, it may be necessary to promote the service side related to the whole link to carry out corresponding service transformation to adapt to the pressure measurement flow. After the transformation, self-test and verification are required before the pressure measurement can be officially started. The following are some key problems, some of which are quoted from [reference materials at the end of this article][General idea of full-link pressure measurement].

Dirty data problem

This problem does not exist if you operate in a separate set of environments
Shadow table: If the operation is performed online, data is generally written to the shadow table (a different name table that is consistent with the original data table in schema) instead of the original data table to achieve isolation of pressure data from online data
Whitelist: specifies the test ID or test account. After the test is stored in the database, the test data is identified by the unified ID and processed in a unified manner
All kinds of storage layer pressure test transformation, including cache layer, message queue, offline database isolation problems. The conventional method is in the pressure gauge linkPass through pressure measurement mark(also known as flow coloring, quite graphic), such as adding JSON datais_stressMark, the storage layer according to mark to distinguish pressure measurement flow, pressure measurement data add specified prefix and suffix before storage

What about services that do not participate in the pressure survey

Mock Server: Non-intrusive business code by recording request and response
Service stub: the service stub processes the pressure test traffic. It is similar to the single-test stub. The code stub simulates the service to return a response and needs to be modified

You can independently deploy a set of offline environment for pressure measurement. On the premise of not affecting the online environment, ensure that the equipment room, network, storage, and upstream and downstream services are consistent with the online environment. Deploy an independent environment for testing, and isolate the machine from the online environment. If the machine fails, the online environment will not be affected. This method is only for a few systems, because it is difficult to deploy a set of all the systems of the whole link independently, so the application scope is limited.

Remarks: How to define upstream and downstream? RFC 2616

Upstream and downstream describe the flow of a message: all messages flow from upstream to downstream.

The downstream input comes from the upstream output. Suppose there are services A and B, and A calls B (or A depends on B), then B is upstream of A and A is downstream of B, because A’s input comes from B’s output.

To put it more simply, the closer something is to the user, the further downstream it is.

The more common method is to use on-line environmental manometry directly, manually or regularly initiated during periods of low machine load (such as late at night).

Pressure measurement monitoring system

After confirming the technical support of the pressure test process and the support of Mock data, it is also necessary to confirm whether the monitoring system of the pressure test link is complete, to facilitate the timely detection of problems in the pressure test process, to accumulate historical pressure test data, and to confirm whether the monitoring system itself is reliable and in place. General monitoring items include (namely pressure measurement index) :

Core interface and core dependent traffic, response time, success rate
Message queues, caches, databases
Machine physical resources

Pressure test data

In fact, there is no mystery about pressure measurement data. What is said on the Internet to produce data according to the business model is too abstract in expression but not easy to understand. In fact, it means to construct the data needed according to the core business scenarios. The key is how to simulate online data distribution in a scientific way. According to reference materials at the end of the article, there is the following business flow funnel model in alibaba’s Double Eleven Promotion, which needs to allocate traffic proportion to different scenes in a scientific way. This proportion is analyzed rather than just for the head. It can be imagined that all the traffic promoted by Ali cannot finally go to the payment process, and a lot of traffic will inevitably end up in the previous process, which means that if you construct all the pressure test data as [go to the payment scene], your pressure test results are not accurate.

In order to better simulate the real online user scenarios and data, it is a common method to dump online data for pressure measurement. There are two simple ideas:

Direct playback of prerecorded line traffic to the pressure gauge link
Divert part of the existing traffic to the pressure measurement link

Data dump can not be used directly, on the one hand, there is no pressure test mark will pollute online data, and on the other hand, it involves user privacy data. Online data can be used as data source and converted into pressure measurement data after collection, filtering, desensitization and other operations. Note the following:

Make sure data is marked with pressure gauge
Account data must be prepared for login authentication in advance
Data should be as consistent as possible with real data, such as prices, pictures, etc
Whether the data has special requirements such as different device models
Try to maintain the same cache hit ratio as online
Special requirements for other service features……

The pressure measurement process

Basic train of thought to do quality guarantee is the same, fine granularity, began to slowly integrated into the whole system, as a single measurement – > interface – > integration testing, pressure measurement is also begin with simple, step by step towards full resources full link, you can refer to the process: single single interface – > interface – > scene of 1/4 resources 1/4 – > the whole amount of resources – > dial test pressure test.

Single-port single machine

Deploy a single service on a machine with a single CPU (or few physical resources) and obtain the single-cpu performance of the service (unit: QPS/core) excluding external links and networks. Then expand the capacity based on the single-cpu performance indicator and the target value. In addition, because it is a single machine with single interface, there is no impact of other interface requests, and upstream and downstream will not cause bottlenecks in the case of sufficient resources, so it can ensure the real performance of the service.

The single-interface single machine can find the problem before the formal start of large-scale pressure test, which is convenient for RD to optimize the performance and quickly check the optimization effect. Some of the problems will be first discovered in the single-interface single-machine pressure test, while some of the deeper hidden problems need to be postponed to the full-link high-flow pressure test to expose.

Single interface 1/4 resource

In the process of single-interface single-machine pressure test, the server has completed some performance optimization, and then it can enter the one-interface one-quarter resource pressure test. This is to verify whether the single-interface performance data obtained in the single-interface single-machine pressure test will increase linearly under the capacity expansion of one-quarter resources, whether there is performance loss and locate the loss source.

Scenario 1/4 resource

The limitations of the single-interface pressure test are obvious. Because other interfaces of upstream and downstream services are used in the scenario-based pressure test, problems that cannot be found in the single-interface pressure test can be discovered, which is more similar to the online user scenario.

Full resource full link

After all resources are in place, it is the last step of the internal pressure measurement process to estimate whether the online pressure can withstand.

Dial test

In addition to the Intranet crimp test, dial-up tests should be performed to check whether the bandwidth resources from the client to the server meet the expectations. The Intranet crimp test has verified the service performance. Therefore, you can select only one scenario for the dial-up test. (Simply speaking, dial measurement is equivalent to pressure measurement of CDN to check whether the RESOURCES of CDN nodes in different places are sufficient)

Pressure test strategy

The pressure measurement process should also be planned in advance, and then some manual strategy adjustment should be added. Ali promotion will also have a warm-up link, run a part of the traffic in advance to cache the data in advance. There are several subdivided pressure measurement strategies for formal pressure measurement, quoted from the references at the end of this article:

Peak pulse: Whether the flow rate increases gradually in a small slope, or rises sharply and remains at a peak
System height: Turn off fallback functions such as fusing downgrade and current limiting, improve pressure and observe the turning point of system performance
Fallback policy verification: Enable fallback functions, such as fusing circuit breaker and current limiting, to check whether these functions take effect and whether the system can withstand fallback
Destructive test: This test is used to verify the effectiveness of the plan. It is similar to the plan execution drill in disaster recovery drill to verify the rescue plan

In addition to the preceding indicators, check whether the traffic in the equipment room is even. If the traffic is not even, check whether the load balance works.

Pressure measuring ending

The end of the process does not mean the end of the pressure test.

Data cleaning

If you use a shadow table, it might be easier to wind up and just drop the shadow table. If the data is directly dropped into the online database, a lot of pressure measurement data may need to be cleaned, and the data will be dyed during pressure measurement (such as specifying test accounts or traffic to carry pressure measurement marks), passed through layer by layer, and finally deleted according to the mark identification.

Q&A

Here are some typical problems you might find:

Excessive HTTP headers exist, which occupies extra bandwidth. Procedure
Spin_lock has a large impact on RT, optimize the lock method
Adjusting the number of Nginx workers can improve performance
Inappropriate number of long links
The code implementation does not reuse objects well
The cache hit ratio is not expected
There is redundancy in the business process
A cache layer is missing
Response codes or error codes may continue to be specified
Insufficient downstream service resources (other monitoring and storage)
The internal system needs to change the configuration or negotiate to remove the current limiting

…

Pressure test summary

Give a complete example of the pressure measurement process:

Determine the pressure test target, estimate the target value of each indicator
Determine the interfaces to be tested based on their priorities and application scenarios
Verify link integrity by combing services on the pressure test link
Pressure test service for the design of pressure test link is reformed
Prepare the pressure test data and confirm the pressure test strategy
Start the pressure test, monitor various indicators, and check the performance optimization effect through multiple rounds of pressure test
Pressure test environment cleaning
Pressure test summary report output

Finally, the pressure test should output a report summary, which is to record the whole pressure test scheme, process and conclusion, specify the pressure test objective, pressure test interface, pressure test data and pressure test conclusion, give the problems found and provide optimization plan. Often by the time the pressure test report is completed, the performance problems have been basically solved. The significance of the report is to sort out the whole process in front and provide experience and guidance for the subsequent pressure test.

The resources

Why Averages Suck and Percentiles are Great

CoolShell- How to do performance tests

The general idea of full link pressure measurement

An exclusive sneak peek | ali do double 11 link all pressure test?

Experience traffic rush hour: an Ali technology employee’s double 11 years

What is Upstream and Downstream in Software Development?

An Ali tech man experienced six years of “Double 11” : Technology changes Ali

This article is published by OpenWrite, a blogging tool platform

www.guru99.com/performance… ↩ ︎