How did we optimize HAProxy to support 2,000,000 concurrent SSL connections?

Chinese source: www.infoq.com/cn/articles…

The original English: medium.freecodecamp.org/how-we-fine…

Looking closely at the above image, we can see two things:

The machine established 2.38 million TCP connections;
In this case, the memory usage is about 48 GB.

Doesn’t it look great? Wouldn’t it be nice if someone could provide the configuration and do tuning on this scale on a single HAProxy server? This article describes this process in detail;)

A lecture on technological innovation and business models

The impact of openness on programming language development from C#

Netflix’s engineering culture: What motivates us?

The father of Baidu Tieba: Discovery and Growth of product managers

Apache Kafka’s past, present, and future

Related Sponsors

This article is the last in a series on HAProxy stress testing. If you have time, I suggest you read the rest of this series. This will help us understand some of the background needed for kernel-level tuning mentioned in this article.

HAProxy Stress Testing (Part I)
HAProxy Stress Testing (Part 2)

To achieve the effect mentioned earlier, the entire setup process relies on many small components. Before Posting the final configuration of HAProxy (you can scroll to the bottom if you are impatient), this article will follow my steps towards the final goal.

What are we going to test

The component software used in this test is Haproxy version 1.6. We’ve been using this version on 4-core 30GB servers in production environments, but none of the connections are SSL-based at this point.

In this experiment, we will test two indicators:

The first is CPU utilization, which increases when we change all the load from non-SSL to SSL connections. In this case, CPU utilization is bound to increase due to the 5-way handshake and packet encryption required to establish an SSL connection.
Second, we wanted to test the inflection point between the number of requests and the maximum number of concurrent connections in the current production configuration.

We need to capture the first metric because the feature we’re rolling out relies on SSL-based communication. The second metric is to be able to properly arrange the number of HAProxy dedicated machines in the production environment.

Test the components used

Multiple press
Several HAProxy 1.6 machines with different configurations:
- Quad-core 30GB ram
- 16-core 30GB ram
- 16-core 64GB ram
The back-end machine used to support these concurrent connections

HTTP and MQTT

If you read the first article in this series, you should know that our entire infrastructure supports two protocols:

HTTP
MQTT

In our technology stack, HTTP 2.0 is not used, so there is no functional requirement to persist HTTP connections. Therefore, the number of connections (input + output) for a single HAProxy machine in a production environment is about 2* 150K. Although the number of concurrent connections is not high, the number of requests per second is still quite high.

Different from HTTP, MQTT protocol is another communication protocol. It provides quality-of-service (QoS) related parameters and also supports connection persistence. Therefore, continuous communication can be carried out in both directions on the MQTT channel. Since HAProxy can support MQTT (TCP-based) connections, we count around 600-700K TCP connections per server at peak times.

We expect the stress test to give concrete data on the number of connections for both HTTP and MQTT protocols.

For HTTP application servers, there are a number of off-the-shelf tools that can help with stress testing and provide such things as summary results, charting, and so on; But for the MQTT protocol, there are few similar tools. We developed a tool of our own, but it was not stable enough in this high pressure scenario.

So we decided to use the HTTP stress test client to simulate the MQTT protocol configuration. Isn’t that interesting? Let’s move on.

Initial setup

The tuning process will be described in detail later, hopefully to help with stress testing and performance testing in similar scenarios.

We first used a 16-core 30GB machine to install HAProxy. We did not directly use the current production server configuration because the SSL connection to the HAProxy terminal was expected to consume a lot of CPU.
On the server side, we use node.js services againstpingRequest and responsepong.
For the client, we’re just getting startedApache Bench. The main reason is thatabIs one of the more well-known and stable HTTP protocol stress testing tools, and it provides a good summary report to help analyze the results.

Ab provides a number of useful parameters, which are also used in later stress tests, for example:

-c, the number of concurrent requestsDefine the number of concurrent requests for services;
-n indicates the number of requestsDefine the total number of requests for the current stress test;
-p Sends filesSend a file with a POST request;

If we look closely at these parameters, we can see that many different use cases can be combined from these three parameters. A simple ab command for example:

ab -S -p post_smaller.txt -T application/json -q -n 100000 -c 3000 http://test.haproxy.in:80/ping
Copy the code

The command output is similar to:

Some of the data that need special attention are:

99% delay
Time per request
Number of failed requests
Requests per second

The biggest problem with AB is that the number of requests per second cannot be controlled by parameters. We can only adjust the concurrency to get the desired number of requests per second, which increases the number of attempts and error chances.

Universal chart

We cannot draw conclusions from the results of multiple random pressures because such data are meaningless. In order to obtain meaningful results, a series of test scenarios must be designed. So we refer to this graph:

The figure shows that up to a particular point, the latency is essentially unchanged as the total number of requests increases. However, after this inflection point, the response delay increases almost exponentially. This is the inflection point of the machine or configuration that we want to measure.

Ganglia

Before providing some test results, let’s introduce the Ganglia tool.

Ganglia is a scalable distributed monitoring system designed for high-performance computing systems such as clustering and grid computing

The graphs below are screenshots of monitoring data from one of our servers to give you a sense of Ganglia and the graphic information it can provide.

Doesn’t it look good?

We use Ganglia to monitor the HAProxy server to provide some core metrics, including:

TCP connections: This tells us the total number of TCP connections created on the current system. Note that this data is the sum of the input and output connections.
Data packets sent and received: Indicates the number of TCP packets sent and received by HAProxy.
Data Sent and received: Indicates the total amount of data actually sent and received.
Memory: The memory change of the server during the stress test.
Network: Learn about the network bandwidth during the stress test.

Here are some limitations based on previous tests and what we hope to achieve in this stress test:

Number of TCP connections 700K

50K packets were sent and 60K packets were received

The amount of data sent and received ranges from 10 MB to 15MB

Memory consumption is about 14-15GB

The network bandwidth is 7MB

The above data are all data per second

HAProxy Nbproc configuration

When we first stress-tested HAProxy, we saw a spike in CPU usage after adding SSL, but the number of requests per second was very low. After checking with the top command, we found that HAProxy only used 1 CPU core, and there were 15 free cores on our machine.

Through Google, HAProxy has a setup that allows it to take full advantage of multiple cores. This configuration is called NBProc, and you can refer to this article for more details:

Blog.onefellow.com/post/824783…

This configuration tuning enabled the subsequent stress testing to continue by enabling HAProxy to take full advantage of multiple cores to continue the various mixed scenarios in the subsequent stress test set.

Stress test using AB

From here on, we will officially enter the stress test, and we will not repeat our metrics and expected targets.

Our only goal at the beginning was to find the performance inflection point by analyzing the ab command parameter variation described earlier.

The table above is the result data after we conducted several stress tests. More than 500 tests were conducted to achieve this result, and it was statistically evident that there were multiple changes in each test result.

Single client problem

As the pressure gradually increased, we found that the pressure client became a bottleneck. According to the Apache Bench documentation, it only uses single core when making requests and has no Settings to improve its performance with multiple cores.

To improve client performance, we used a tool on the Linux platform called Parallel. If the name is the same, the tool can run commands in parallel to take full advantage of the CPU core. That’s what we expect. Here is an example command to run multiple AB clients using Parallel:

cat hosts.txt |  parallel  'ab  -S -p post_smaller.txt -T
 application/json -n 100000 -c 3000 {}'
sachinm@ip-192-168-0-124:~$ cat hosts.txt
http://test.haproxy.in:80/ping
http://test.haproxy.in:80/ping
http://test.haproxy.in:80/ping
Copy the code

The above command can run three AB clients in parallel and access the same URL at the same time. This helps us solve the client side performance bottleneck.

The Sleep and Times parameters of the server

We’ve already mentioned some of the data collected by Ganglia, but let’s discuss how to simulate this data generation.

The number of packets sent and received. This data can be simulated by sending some data in a POST request. This approach also helps with bandwidth and bytes sent and received in Ganglia data items.
Number of TCP connections established. It took us a long time to simulate this data. Imagine if a request response time is 1 second and it takes about 700K requests per second to reach the scenario we booked. This data is easy to achieve in a production environment, but almost impossible to generate in our test scenario.

The reader might ask, well, how did you do that? We introduced the sleep parameter in the POST request parameter, which allows the server to sleep for a specific millisecond before returning the response data. This simulates time-consuming requests in a production environment. So we set up sleep for 20 minutes so that we could maintain a water level of 700K connections at around 583 requests per second.

In addition, we also introduced another parameter: times when we made a POST request to HAProxy. This parameter specifies the number of times the server responds to the data before closing the TCP connection. This will help us generate more data for the simulation.

Apache Bench has encountered problems

Although we got a lot of result data from Apache Bench, we also ran into a lot of problems. I will not cover all of the problems encountered here, as this is not the end of this article, and the new pressure client will be introduced later.

We get good results from Apache Bench, but the TCP connection requirement is always hard to meet every time we get a point of data. I don’t know why Apache Bench didn’t handle the sleep parameter correctly, and it still didn’t meet our capacity requirements.

As mentioned earlier, we executed multiple AB clients in Parallel using the Parallel tool on a single press machine, but this approach was not possible across multiple press machines. It is a pity that PDSH was not discovered at that time.

At the same time, the previous data we still lack the timeout data. On HAProxy, we have some default timeout Settings, but the AB client and back-end service completely ignore these Settings. We spent a lot of time on that, trying to fix the stress test.

The inflection point chart was mentioned earlier, but the goal shifted as problems arose. However, in order to draw meaningful conclusions, the focus must be refocused.

With Apache Bench, we have reached an inflection point, but the number of TCP connections has not increased. These numbers are based on about 40 to 45 AB clients running on five to six pressers, but the number of TCP connections has not been on the order of magnitude expected. In theory, the number of TCP connections should increase as the sleep parameter value increases, but in practice it has no effect.

The introduction of Vegeta

Based on the problems I encountered with Apache Bench, I continued my search for other stress testing tools that are more powerful and scalable, and finally found Vegeta.

From my personal experience, Vegeta is very scalable compared to Apache Bench and offers many more features. In our stress test scenario, one Vegeta client produced a throughput equivalent to 15 Apache Bench clients.

Below are the stress test results obtained using Vegeta.

Use Vegeta for stress testing

Let’s start with the use commands for the individual Vegeta client. Interestingly, the command parameter used to pressure the back-end server is called Attack :p

echo "POST https://test.haproxy.in:443/ping" | vegeta -cpus=
32 attack -duration=10m  -header="sleep:30000"  -body=
post_smaller.txt -rate=2000 -workers=500  | tee reports.bin | vegeta report
Copy the code

Here we have a brief look at some parameters provided by Vegeta:

-cpus=32, define the number of kernels used by the client. In order to achieve the desired pressure value, we adjusted the press configuration to use 32 cores 64GB of ram. A close look at the resulting data shows that the actual pressure is not that great, and the main purpose of the configuration adjustment is to be able to support a large number of connections in the dormant state of the back-end server.
-duration=10mIf no execution time is specified, the test will run permanently.
-rate=2000, requests per second.

As you can see from the figure above, we hit 32K requests per second using only a 4-core machine. This result has higher performance than the inflection point graph previously obtained, where the inflection point for non-SSL requests is 31.5K.

Here are some more stress-test numbers:

For SSL connections, 16K requests per second is not too bad. Due to the new client, the entire stress test process had to be started from scratch, but overall the results were better than those from the AB client.

As the number of CPU cores increases, the response latency decreases at the same pressure until the pressure reaches the CPU performance limit.

However, we did not see much increase in requests per second as the number of CPU cores increased from 8 to 16. However, if we do eventually decide to use an 8-core machine in production, it is not possible to allocate all the cores to HAProxy without being occupied by any other processes. So we decided to do some testing with core 6 to verify that we could accept the performance results.

Not too bad.

Introduce sleep parameter

We are very pleased with the results of the stress tests so far. However, the current scenario does not simulate a real production scenario until the sleep parameter is introduced.

echo "POST https://test.haproxy.in:443/ping" | vegeta 
-cpus=32 attack -duration=10m  -header="sleep:1000" 
 -body=post_smaller.txt-rate=2000 -workers=500  | tee reports.bin | vegeta report
Copy the code

The above command sets a sleep time of 1000 ms, which causes the server to apply random sleep between 0 and 1000 ms. Therefore, the average latency of the above commands is ≥ 500ms.

The numbers in the last cell represent:

Number of TCP connections established, packets sent, and packets receivedCopy the code

It is clear from this that the maximum number of requests per second decreased from 20K to 8K on a 6-core machine. Obviously, increasing the sleep time has a significant impact on the results due to the large number of TCP connections. However, the total number of connections is already close to the 700K level we expect.

Milestone # 1

How do we increase the number of TCP connections? It’s as simple as increasing the sleep time with the sleep parameter. We kept increasing this parameter and ended up at 60 seconds, with an average delay of around 30 seconds.

Vegeta provides an interesting parameter: request success rate. We found that only 50% of requests were successful during this sleep time. See the following data:

By setting the 60000 millisecond sleep time, we achieved results of up to 400K TCP connections while 8K requests per second. The R in the chart 60000R stands for random.

The first problem we found with this is that the default request timeout time for Vegeta is 30 seconds, so 50% of requests fail. So we set this timeout to 70s to avoid a repeat encounter in subsequent tests.

After modifying the client timeout, we easily reached 700K connections. The only problem is that the number of connections is not stable, but the peak can be reached. Therefore, the system can reach 600K to 700K peak connections, but not for a long time.

We want to be able to achieve this:

The figure shows that the connection number remains stable at 780K. The number of requests per second in the table above is very high, but in a real production environment the number of requests per HAProxy machine would be much lower (around 300).

But if we were to drastically reduce the number of HAProxy machines in a production environment (currently around 30, which means 30*300 cluster requests per second around 9K), the first bottleneck would be the number of TCP connections, not the CPU.

So we decided to try to verify with 900 requests per second, 30MB/s network bandwidth, and 2.1m TCP connections. We used this scenario because it was three times the pressure of a single HAProxy machine in production.

Additionally, only six kernels are currently assigned to HAProxy. We wanted to test performance with three cores allocated, because this was the easiest way for us to simulate our production machine configuration (as mentioned earlier, our production machine configuration had four cores and 30GB of memory). Therefore setting nbProc = 3 is the most convenient way.

Remember that the machine we are using now has 16 cores and 30GB of ram, with 3 cores allocated to HAProxy.Copy the code

Milestone # 2

We have the upper limit of requests per second for different machine configurations, and now we are left with the aforementioned task of achieving three times the load of the production environment

Requests per second 900
Number of TCP connections 2.1 M
Network bandwidth 30MB/s

Again, we were stymied in reaching 220K TCP connections. No matter how much sleep time is set, the number of TCP connections cannot rise.

Let’s calculate 220K TCP connections and requests per second 900,110,000/900 ~= 120 seconds. 110K is used here because 220K connections contain both inputs and outputs, a two-way total.

This leads us to suspect that 2 minutes is a limit somewhere on the system, which can be verified by looking at the HAProxy logs, where most connections total 120,000ms.

Mar 23 13:24:24 localhost haproxy[53750]: 172.168.0.232:48380 [23/Mar/2017:13:22:22.686] APi-apI-backend /http31 39/0/2062/ -1/122101-1 0 - -SD -- 0/0 1678/35 1714/1714 / / 0 {0, ""," "} "POST/ping HTTP / 1.1"Copy the code

Where 122,101 is the total processing time. Refer to the HAProxy document for detailed values of all fields in the log.

Upon further investigation, we found that Node.js has a default timeout of two minutes.

For details, see the following materials:

How do I change the default node.js request timeout
Node.js Http server document

After the timeout, things didn’t go as smoothly as expected. When the connection count reaches 1.3m, the HAProxy connection count suddenly drops to zero and then starts to rise again. After running the dmesg command to view the kernel logs, it was found that the system memory was insufficient. By switching to 16-core 64GB memory and setting NBProc = 3, we ended up with 2.4m connections.

The back-end code

Here is the source code for the HAProxy backend service. We use the StatSD library in our code to get the number of server requests per second.

var http = require('http'); var createStatsd = require('uber-statsd-client'); qs = require('querystring'); Var SDC = createStatsd({host: '172.168.0.134', port: 8125}); var argv = process.argv; var port = argv[2]; function randomIntInc (low, high) { return Math.floor(Math.random() * (high - low + 1) + low); } function sendResponse(res,times, old_sleep) { res.write('pong'); if(times==0) { res.end(); } else { sleep = randomIntInc(0, old_sleep+1); setTimeout(sendResponse, sleep, res,times-1, old_sleep); } } var server = http.createServer(function(req, res) { headers = req.headers; old_sleep = parseInt(headers["sleep"]); times = headers["times"] || 0; sleep = randomIntInc(0, old_sleep+1); console.log(sleep); sdc.increment("ssl.server.http"); res.writeHead(200); setTimeout(sendResponse, sleep, res, times, old_sleep) }); server.timeout = 3600000; server.listen(port);Copy the code

We also have a small script to run multiple back-end services. Throughout the test, we used eight servers with 10 back-end server processes running on each server to avoid the bottleneck of back-end services known as stress tests.

counter=0
while [ $counter -le 9 ]
do
   port=$((8282+$counter))
   nodejs /opt/local/share/test-tools/HikeCLI/nodeclient/httpserver.js $port &
   echo "Server created on port "  $port
   ((counter++))
done
echo "Created all servers"
Copy the code

Client code

For clients, there is a limit of 63K TCP connections per IP address. If you don’t understand this, see the previous article in this series.

So in order to get to 2.4m connections (two-way connections, 1.2m connections for clients to initiate), we need about 20 machines. Running the Vegeta command on all 20 machines at the same time was very painful, even with tools like CSSHX, you still needed to merge the final test results from all Vegeta.

The script is as follows:

Result_file =$1 declare -a machines=("172.168.0.138" "172.168.0.141 ""172.168.0.142" "172.168.0.18" "172.168.0.5" "172.168.0.122" "172.168.0.123" 172.168.0.124 172.168.0.232 "" "" "172.168.0.24 4" "172.168.0.170 172.168.0.179" "" " 172.168.0.59" "172.168.0.68" "172.168.0.137" "; Number of bins="" number of bins="" number of bins="" for I in "${machines[@]}"; do bins=$bins","$i". bin"; commas=$commas","$i; done; bins=${bins:1} commas=${commas:1} pdsh -b -w "$commas" 'echo "POST http://test.haproxy.in:80/ping" | /home/sachinm/.linuxbrew/bin/vegeta -cpus=32 attack -connections=1000000 -header="sleep:20" -header=" times:2" -body=post_smaller.txt -timeout=2h -rate=3000 -workers= 500 > ' $result_file for i in "${machines[@]}"; do scp sachinm @$i:/home/sachinm/$result_file $i.bin ; done; vegeta report -inputs="$bins"Copy the code

Fortunately, the PDSH tool is used, which allows us to execute commands in parallel across multiple remote servers. Meanwhile, Vegeta also provides the result merge feature, which we desperately need.

HAProxy configuration

This section is probably the most interesting for readers, but here is the HAProxy configuration we used in our stress test scenario. The most important of these are the NBProc and MAXCONN Settings. The maxCONN setting allows HAProxy to support the desired number of TCP connections.

maxconnSetting ulimit affects the HAProxy process. For example:

Max file open count is set to 4m because HAProxy Max connection count is set to 2m. Neat and clean!

Here are some HAProxy optimizations to achieve our desired metrics:

www.linangran.com/?p=547

Here we go from http30 to http83 :p

That’s all for this article, if you can read this far, I really admire 🙂

Special thanks to Dheeraj Kumar Sidana, without whose help I could not have achieved so many meaningful results. 🙂

Let me know if this article has helped you. In the meantime, if you find this article useful, please recommend it and help spread it.

How We Fine Tuned HaProxy to Achieve 2000000 Concurrent SSL connections

—

A link to the

Medium.freecodecamp.org/load-testin… System 1 https://medium.com/@sachinmalhotra/load-testing- haproxy-part-1-F7D64500B75D series 2 medium.com/@sachinmalh… CSSHX: https://github.com/brockgr/csshx PDSH: github.com/grondo/pdsh haproxy configuration: Vegeta: https://www.linangran.com/?p=547 github.com/tsenart/veg… Parallel: http://www.shakthimaan.com/posts/2014/11/27/gnu-parallel/news.html nbproc setup: Blog.onefellow.com/post/824783…