The Netflix CSA team is committed to helping the system reduce errors, improve availability, and enhance Netflix’s ability to cope with failures.
The reason for this is that at a scale of more than a million requests per second, even a low error rate can affect a member’s experience, so it is necessary to reduce the error rate.
So we started learning from Zuul and other teams to improve our load balancing mechanism and further reduce errors caused by server overloads.
In Zuul, we always use Ribbon load balancer (https://github.com/Netflix/ribbon/), the other polling (round – robin) scheduling algorithm and is used to connect the high failure rate of server blacklisted some filtering mechanism.
Over the years, we’ve made a number of improvements and customizations designed to send less traffic to recently started servers so they don’t overload.
These results were significant, but for some particularly troublesome source clusters, we still saw load-related error rates that were much higher than expected.
If all the servers in the cluster are overloaded, there is little improvement in choosing one server over another.
However, we often see cases where only some servers are overloaded, such as:
-
After the server is started (during red-black deployment and automatic scaling events).
-
The server is temporarily slowed/blocked by staggered dynamic properties/scripts/data updates or large garbage collection (GC) events.
-
Bad server hardware. It’s common to see some servers never run as fast as others, whether it’s noisy neighboring systems or different hardware.
Guiding principles
There are a few principles to follow when starting a project, and here are some of them.
Focus on the constraints of existing load balancer frameworks
We combined our previous load balancing customizations with the Zuul code base and were unable to share them with the rest of the Netflix team.
So this time we decided to accept the constraints and the extra effort required, keeping reusability in mind from the start. This makes it easier for other systems to adopt, reducing the chance of reinventing the wheel.
Learn from others
Try to borrow other people’s ideas and techniques. Take, for example, the choice-of-2 and probation algorithms previously examined in Netflix’s other IPC stacks.
Avoid distributed states
Prioritize local decisions to avoid the flexibility issues, complexity, and latency associated with cross-cluster coordination states.
Avoid client configuration and manual adjustment
Our experience with Zuul over the years has shown that placing part of a service configuration in client services that are not part of the same team can cause problems.
One problem is that these client-side configurations are often out of sync with changing server-side configurations, or need to incorporate change management between services that belong to different teams.
For example, upgrading the EC2 instance type for service X results in fewer nodes required for the cluster. Therefore, the Max connections per host client configuration in service Y should now be added to reflect the added capacity.
Should client changes be made first? Or do you make server-side changes? Or at the same time? The Settings are likely to be completely forgotten, leading to more problems.
If possible, use adaptive mechanisms that change based on current traffic, performance, and environment rather than configuring static thresholds.
If static thresholds are really needed, let the service communicate them at run time to avoid the problems of pushing changes across teams, rather than having the service team coordinate the threshold configuration for each client.
Load balancing method
One general idea is that while the best data source for server-side latency is the client view, the best data source for server utilization comes from the server itself. Combining these two data sources provides us with the most efficient load balancing.
We used a combination of complementary mechanics, most of which had previously been developed and used by others:
-
Choice-of-2 algorithm for choosing between servers.
-
Load balancing is performed based on the load balancer’s understanding of server utilization.
-
Secondly, balance is carried out according to the utilization rate of the server.
-
Mechanism based on inspection and server age to avoid overloading of the newly started server.
-
The collected server statistics slowly decay to zero.
Combine join shortest queue and server reported utilization
We choose to combine the commonly used join shortest queue (JSQ) algorithm and choice-of-2 algorithm based on server report utilization, in an attempt to combine the advantages of both.
The problem of JSQ
Joining the shortest queue works well for a single load balancer, but can be a serious problem if used alone in a cluster of load balancers.
The problem is that load balancers tend to herd, simultaneously select the same underutilized server and overload it, then move on to the next least utilized server and overload it, and so on…..
This can be solved by using a combination of JSQ and choice-of-2 algorithms. That basically solved the grazing problem.
JSQ typically does this by counting the number of connections from local load balancer to server usage only, but the local view can be misleading when there are tens to hundreds of load balancer nodes.
Figure 1: The view of a single load balancer can be quite different from reality
For example, in this figure, load balancer A has one inflight request to server X, one request to server Z, but no request to server Y.
So when it receives a new request and has to choose which server is least utilized, it will choose server Y from the available data.
Although this is not the right choice, server Y is actually the most utilized because the other two load balancers both have requests currently being processed, but load balancer A has no way of knowing this. This shows how the view of a single load balancer is completely different from the reality.
Another problem we encountered with relying solely on the client view was that for large clusters (especially with low traffic), load balancers often had only a few active connections with some of the hundreds of servers.
Therefore, when it chooses which server has the least load, it usually has to choose between zero and zero, meaning it doesn’t have data on utilization of any of the servers it wants to choose, so it has to make a random guess.
One solution to this problem is to share the state of the number of requests being processed per load balancer with all other load balancers, but that would address the issue of distributed state.
We usually use distributed mutable state as a last resort because the benefits need to outweigh the actual costs involved:
-
Adding operational overhead and complexity to tasks such as deployment and Canarying.
-
Elastic risk related to the scope of influence of data corruption (i.e. bad data on 1% load balancers is annoying, bad data on 100% load balancers is a failure).
-
The cost of implementing a P2P distributed state system between load balancers, or of running a separate database with the performance and flexibility needed to handle this massive read and write traffic.
We chose another, simpler solution, relying instead on reporting server utilization to each load balancer.
Utilization reported by the server
Using per-server knowledge utilization has the advantage that it consolidates all load balancers using that server, thus avoiding the JSQ problem of incomplete knowledge.
There are two ways we can achieve this:
-
Use health check endpoints to actively poll for the current utilization of each server.
-
Passively tracks the response from the server annotated with the current utilization data.
We chose the second method for the simple reason that it is easy to update this data frequently and avoid placing an additional burden on the server: have N load balancers poll M servers every few seconds.
One effect of this passive strategy is that the more frequently a load balancer sends requests to a server, the newer it gets a view of that server utilization.
Therefore, the higher the number of requests per second (RPS), the better the load balancing. Conversely, the lower the RPS, the worse the load balancing effect.
This is not a problem for us, but active polling health checks may be more effective for services that receive low RPS through one particular load balancer and high RPS through another load balancer.
The tipping point occurs when the LOAD balancer sends RPS per server below the polling frequency used for health checks.
Server-side implementation
We implemented this mechanism on the server side by simply tracking the number of requests being processed, converting it to a percentage of the maximum configured value for the server, and writing it out as an HTTP response header:
X-Netflix.server.utilization: <current-utilization>[, target=<target-utilization>]
Copy the code
An optional target utilization can be specified by the server, indicating what percentage utilization they intend to operate at under normal conditions. The load balancer then uses this for coarse-grained filtering, described later.
We experimented with metrics other than the number of requests being processed, such as CPU utilization and load average reported by the operating system, but found that they caused variation. Therefore, we decided to use a relatively simple implementation: count only the requests being processed.
Choice-of-2 algorithm instead of round robin scheduling algorithm
Because we wanted to be able to compare statistics to select servers, we abandoned the existing simple round robin scheduling algorithm.
Another approach we tried was JSQ in combination with ServerListSubsetFilter to reduce the grazing problems of distributed-JSQ. This gives reasonable results, but the distribution of requests on the target server is still too wide.
So instead, we implemented choice-of-2 using an earlier lesson learned by another team at Netflix. The advantage is that it is easy to implement, keeps the CPU cost on the load balancer low, and the requests are well distributed.
Selection is based on multiple factors
To choose between servers, we compare for three different factors:
-
Client health: Rolling percentage of connection-related errors for this server.
-
Server utilization: The latest score provided by this server.
-
Client utilization: The current number of requests being processed from this load balancer to this server.
These three factors are used to assign points to each server and then compare the total points to choose the winner. Using multiple factors like this does increase the complexity of the implementation, but it protects against the extreme case problem of relying on just one factor.
For example, if a server starts to fail and rejects all requests, it will report much lower utilization (because requests are rejected faster than they are accepted); If this is the only factor used, then all load balancers will start sending more requests to the broken server. Client health factors can mitigate this situation.
filter
When we randomly select the two servers to compare, we filter out any servers that are above the conservatively configured threshold for utilization and health.
Do this filtering on each request to avoid the outdated problem of just filtering periodically. To avoid high CPU load on the load balancer, we just do our best: make N attempts to find a randomly selected viable server, then fall back to an unfiltered server if necessary.
This filtering can be helpful when most servers in the server pool have persistence problems. In this case, randomly selecting 2 servers will often result in selecting 2 bad servers for comparison, even though there are many good servers available.
The downside is that this relies on statically configured thresholds, which we try to avoid. However, our test results convinced us that it was worth adding this approach, even with some generic thresholds.
inspection
For any server for which the load balancer has not received a response, we only allow one active request at a time. We filter the servers on in-probation until we receive a response from them.
This helps to keep the newly started servers from getting overloaded with requests before they have a chance to show utilization.
Preheating based on server life
We use the server life to ramp up the traffic to the startup server in the first 90 seconds. This is another exploratory mechanism that further alerts you to avoid server overloads in the sometimes delicate post-startup state.
Statistical data attenuation
To ensure that the server was not permanently blacklisted, decay rates (currently linear decay within 30 seconds) were used for all statistics collected for load balancing.
For example, if the server’s error rate rises to 80% and we stop sending traffic to it, the value we use decays to zero in 30 seconds (i.e., after 15 seconds the error rate will be 40%).
Operation effect
Wider distribution of requests
Without the negative impact of using polling scheduling for load balancing, where previously we had a very compact distribution of requests across cluster servers, the variation (delta) between servers is much greater.
Using the Choice-of-2 algorithm can help alleviate this situation considerably (compared to JSQ across all or some servers in a cluster), but it is impossible to avoid it completely.
So you need to consider this on an operational level, especially for canary analysis, where we typically compare the absolute values of metrics like number of requests, error rates, and CPU.
Slower servers receive less traffic
This is obviously the desired effect, but it has some ripple effects on the operational side for teams that used to use polling scheduling (traffic evenly distributed).
Since traffic distribution across source servers now depends on utilization, some servers will receive more or less traffic if they are running different builds that are more or less efficient, so:
-
When the cluster is deployed in red and black, less than 50% of the traffic is sent to the new server group if performance degrades.
-
The same results can be seen with the Canary system — the baseline system may receive a different amount of traffic than the canary cluster. So when looking at metrics, it’s a good idea to look at the combination of RPS and CPU (for example, a Canary system might have a low RPS and the same CPU).
-
Less efficient exception detection – We usually have automation technology to monitor abnormal servers in the cluster (usually virtual machines that are slowed down immediately after startup due to some hardware problem) and terminate them. This detection is more difficult when abnormal servers receive less traffic due to load balancing.
Rolling dynamic data updates
The benefit of switching from polling scheduling to this new load balancer is that it can be well combined with phased deployment of dynamic data and properties.
Our best practice is to deploy data updates one area (data center) at a time, limiting the scope of unexpected problems.
Even without any problems caused by the data update itself, the act of the server doing the update can lead to temporary spikes in load (usually associated with garbage collection).
If this peak appears on all servers in a cluster at the same time, it can cause a large peak in load-shedding and errors are propagated upstream. In this case, the load balancer is of little help because all servers are experiencing high load.
One solution, however, if used in conjunction with such an adaptive load balancer, is rolling data updates on clustered servers.
If only a small number of servers are updating at the same time, the load balancer can temporarily reduce the traffic sent to them, as long as there are enough other servers in the cluster to absorb the diverted traffic.
Synthesize load test results
We made extensive use of synthetic load testing scenarios while developing, testing, and tuning different aspects of the load balancer.
This is useful for validating the effects of a real cluster and network as a reproducible step on top of a unit test, but without using actual user traffic.
Figure 2: Comparison of results
For more details on this test, the main points are summarized below:
-
Compared to the polling scheduling approach, the new load balancer reduced load shunting and connection errors by several orders of magnitude with all features enabled.
-
The average and long tail delays are substantially improved (reduced by 3 times compared to the round robin scheduling method).
-
Simply adding server utilization features is a huge benefit, reducing errors by an order of magnitude and greatly reducing latency.
Impact on actual production flow
We found that the new load balancer was very effective in allocating as much traffic as possible to each source server.
The advantage of this is that the routing bypasses both intermittently degraded and continuously degraded servers without any human intervention, which avoids the problem of waking up engineers in the middle of the night and seriously affecting productivity during the day.
It is difficult to show this effect during normal operation, but it can be seen during production accidents; For some services, it can be seen even during normal steady-state operation.
During the accident
There was a recent incident where a Bug in the service caused more and more server threads to slowly block, meaning that from the perspective of the started server, a few threads would block every hour until the server finally started to Max out and split the load.
In the figure below showing the RPS for each server, you can see the wide distribution between servers before 3 a.m. This is because the load balancer sends less traffic to servers with more blocked threads.
Then after 3:25 am, the automatic scaling mechanism starts up more servers, each receiving about twice as much RPS as the existing servers, because they don’t yet have any blocked threads and can therefore successfully handle more traffic.
Figure 3: Requests per second per server
Now if you look at the graph of error rates per server over the same time frame, you can see that errors are evenly distributed across all servers throughout the event, even though we know that some servers have much lower capacity than others.
This indicates that the load balancer is working. Because there is so little total available capacity in the cluster, all the servers are running under a load slightly above the available capacity.
Then, when the automatic scaling mechanism starts new servers, it sends as much traffic to those servers as possible until there are as few errors as the rest of the cluster.
Figure 4: Errors per second per server
In summary, load balancing was very effective at distributing traffic to servers, but in this case, not enough new servers were started to reduce the total number of errors all the way down to zero.
The steady state
We also saw a significant reduction in steady-state noise in some services on servers that experienced a few seconds of load diversion due to garbage collection events. You can see that errors are significantly reduced with the new load balancer enabled:
Figure 5: Load-related error rates in the weeks before and after enabling the new load balancer
Insufficient to remind
Unexpectedly, some inadequacies in our automatic alert mechanism were exposed. Some existing alerts based on service error rates, which were previously triggered when the problem only affected a small part of the cluster, are now triggered much later, or not at all, because error rates remain low.
This means that sometimes a serious problem is affecting the cluster without the team being notified. The solution is to fill in the gaps by adding additional alerts for deviations from utilization metrics, not just error metrics.
conclusion
This article is not intended as an advertisement for Zuul, although it is an excellent system and is primarily intended to complement another valuable approach to the agency/service grid/load balancing community.
Zuul is a good system for testing, implementing, and improving these load balancing solutions; Running Zuul with Netflix’s needs and scale allows us to demonstrate and improve on these approaches.
There are many different approaches that can be used to improve load balancing, and this approach worked well for us, dramatically reducing the load-related error rate and greatly improving the actual load balancing on the server.
But as with any software system, you should make your decisions based on your organization’s terms and goals and try to avoid perfection.
Compiled by Mike Smith, Shen Jianmiao
Editors: Tao Jialong, Sun Shujuan
Source: Technical personnel who have contributed articles and seek reporting intention please contact [email protected]
51CTO Podcasts, anytime, anywhere, fragmented learningSmall program
Excellent article recommendation:
How to build a high availability load balancer to cope with hundreds of millions of traffic?
Just this one, no problem understanding “load balancing”
Understand “load Balancing” in distributed architecture