On May 21, Netflix announced on its official blog that Zuul 2 is an open source microservices gateway component. Netflix is a model in the microservices industry, with successful applications of mass production microservices and open source microservices components (see GitHub for details), which are highly recognized by their peers in the industry. Zuul is a gateway component that Netflix opened source on June 12, 2013, and currently has over 4,000 followers on GitHub. Companies including Riot, Ctrip, and PpDAI are already using Zuul in production environments.

Zuul is a monster in English, and is also found in the Zerg in StarCraft. Netflix named the gateway Zuul, referring to the door god beast. In an interview with Adrian Cockcroft, former Director of architecture at Netflix, around 2013, InfoQ asked Adrian: “With all of Netflix’s open-source projects, which one do you think is the MOST Indispensable?” Adrian replied: “One of the things that tends to be overlooked in the NetflixOSS open source project is one of Netflix’s strongest foundational services, and it’s the Zuul Gateway service. Zuul Gateway is primarily used for intelligent routing, but also supports authentication, area and content-aware routing, aggregating multiple underlying services into a unified external API. One of the highlights of the Zuul gateway is that it is dynamically programmable and the configuration takes effect in seconds. From Adrian’s answer, we can get a sense of the importance of the Zuul gateway to the microservices infrastructure.

In September 2016, Netflix announced that they would change Zuul’s architecture due to its early open source history. Zuul, which originally used a synchronous blocking architecture, is now called Zuul 2 and uses an asynchronous non-blocking architecture. The main architectural difference between Zuul 2 and Zuul 1 is that Zuul 2 runs on an asynchronous, non-blocking framework such as Netty. Zuul 1 relies on multithreading to support throughput growth, while the Netty framework Zuul 2 uses relies on event loops and callback functions.

The following is a description of Zuul 2 from Netflix’s blog for your reference.

Netflix’s Cloud Gateway team runs and maintains more than 80 Zuul 2 clusters, distributing traffic to about 100 (and growing) back-end service clusters, with more than 1 million requests per second. Almost all of this traffic comes from client devices and browsers that enable the familiar discovery and playback experience.

This article details some of the interesting features of Zuul 2, which Netflix is releasing today, and discusses some of the other projects we’re building with Zuul 2.

Here is a general architecture of Zuul 2:

The Filter front-end and back-end Netty event handlers handle network protocols, Web servers, connection management, and proxies. After the internal work is abstracted, all the major work is handed over to the filter. The inbound filter is run before the proxy request and can be used to validate, route, or decorate the request. Endpoint filters can be used to return static responses or to proxy requests to back-end services. The outbound filter runs after the response is returned and can be used for things like gzipping, metrics, or adding and deleting custom request headers.

Zuul’s functionality depends almost entirely on the logic of each filter. This means that it can be deployed in multiple contexts, using configured and running filters to solve different problems.

We use Zuul at the entry point for all external traffic to Netflix’s cloud service and have started using it to route internal traffic as well. Zuul’s core architecture is the same when it is used as an external traffic gateway as it is when it is used as an internal traffic gateway, but there are far fewer filters to implement this function.

The Zuul code running today is the most stable and resilient version of Zuul. After multiple stages of codebase evolution and refactoring, we’re happy to share it with you.

Today we will be releasing a number of core features. Here are the most exciting:

Server protocol

  • HTTP/2 — Full inbound HTTP/2 connection server support

  • Two-way TLS (Mutual TLS) — Supports Zuul running in more secure scenarios

The elastic properties

  • Adaptive retry — The core retry logic Netflix uses to enhance elasticity and availability

  • Source concurrency protection – Configurable concurrency limits to avoid source overload and isolate the various sources behind Zuul

Operating characteristics

  • Request Passport – Tracks all life cycle events for each request, which is useful for debugging asynchronous requests

  • State classification – Enumeration of possible states for successful and failed requests, more elaborate than HTTP status codes

  • Request Attempts – Tracks each agent’s attempts and status, especially useful for debugging retries and routing

We’re also looking into some upcoming features, including:

  • Websocket/SSE — Supports channel push notifications

  • Traffic limiting and limiting – Prevents malicious client connections and requests, and helps defend against large-scale attacks

  • Power-off filter – Disables some CPU intensive features when Zuul is overloaded

  • Configurable Routing – File based routing configuration without the need to create routing filters in Zuul

At Netflix, there are several major features we’ve been working on but haven’t opened source yet. Each of these deserves a blog post, but we’ll cover them briefly for now.

The feature most widely used by our partners is self-service routing. We provide users with applications and apis to create routing rules based on any criteria in the request URL, path, query parameters, or request header. We then publish these routing rules to all Zuul instances.

The primary use case is to route traffic to a specific test or temporary cluster. However, there are many use cases for actual production flow. Such as:

  • Services that need to split traffic create routing rules that map certain paths or prefixes to different sources

  • Developers bring new services online by creating routes that map new host names to new sources

  • Developers run load tests to route a percentage of existing traffic to small clusters and ensure that applications gracefully degrade under load

  • By gradually creating rules that map traffic, one path at a time, teams refactoring applications can gradually migrate to new sources

  • The team tests the changes by sending a small amount of traffic to the instrumented Cluster running the new version (canary test)

  • If the team tests changes that require multiple consecutive requests for the new version, they run Sticky Canary tests that route the same users to the new version in a short period of time

  • The security team creates path-based or request-header rules that reject all “malicious” requests in the Zuul cluster

As you can see, we use self-service routing extensively and are increasing the customizability and scope of routing to support more use cases.

Another major feature we’ve been working on is making load balancing smarter. We were able to bypass the failures, slowdowns, GC issues, and various other issues that are common when running a large number of nodes. The goal of this feature is to improve the resiliency, availability, and quality of service of all Netflix services.

Here are a few of the cases we deal with:

Cold instance

When new source instances start, we reduce their traffic for a period of time until they get hot. This problem has been observed in applications with large code bases and large metadata Spaces. These applications take a lot of time to interpret (JIT) code and are ready to handle a lot of traffic.

If we happen to hit a cooling instance that affects speed, we will usually also bias traffic to an older instance, and we can always retry the hot instance. This has given us an order of magnitude improvement in usability.

High error rate

Errors happen all the time for a variety of reasons, whether due to errors in code, faulty instances, or setting invalid configuration properties. Fortunately, as agents, we can reliably detect errors — whether 5XX errors or service connection problems.

We track the error rate for each source, and if the error rate is high, it means the entire service has a problem. We limit the number of device retries and disable internal retries so that the service can be restored. In addition, we track successive failures for each instance and blacklist failed instances over time.

Overload instance

With the above approach, we send less traffic to servers that limit or reject connections in the cluster and mitigate the impact by retrying these failed requests on other servers.

We are now rolling out an additional approach that aims to avoid server overload in the first place. This is achieved by having sources send Zuul their current utilization, which Zuul then uses as a factor in its load-balancing selection — thereby reducing error rates, retries, and latency.

The source adds a request header for all responses, stating their percentage utilization, and the desired target utilization for the entire cluster. Calculating percentage utilization is entirely up to each application, and engineers can use whatever metric works best for them. This can provide a general solution, compared to our proposed one-size-fits-all approach.

With this feature, we assign each instance a score (a combination of instance utilization and other factors) and make a binary load balancing choice.

As we moved from a handful of sources to one where anyone could quickly start up a container cluster and deploy it behind Zuul, we found that we needed to automatically detect and determine source failures.

Thanks to the Mantis live event stream, we built an exception detector that aggregates error rates for each service and notifies us in real time when a service has a problem. It creates a timeline of all problem sources based on all exceptions within a given time window. We then create an alert email with a context that contains the event timeline and affected services. This allows operations to quickly correlate these events and clarify ideas, debug specific applications or features, and ultimately find the root cause.

In fact, sending notifications to the source team is very useful in itself. In addition to Zuul, we have added more internal applications to build a broader timeline of events. This has been a huge help in production incidents, helping operations personnel quickly identify and resolve problems before serious outages occur.

The original address: https://medium.com/netflix-techblog/open-sourcing-zuul-2-82ea476cb2b3