How to design a highly available microservice architecture

The main points of

Dynamic environments and distributed systems, such as microservices, are more likely to fail.

Failed services should be isolated to achieve elegant service degradation and improve the user experience.

70% of failures are caused by code changes, so sometimes rolling back code isn’t such a bad thing.

If failures occur, make them happen quickly and independently. A team has no control over their service’s dependencies.

Caching, partitions, circuit breakers, and rate limiters are some of the architectural patterns that help build reliable microservices.

Microservices architectures make failure isolation possible with well-defined boundaries, but every distributed system has the same problem — failure at the network, hardware, or application level. Because there are dependencies between services, the failure of any one component affects the component dependencies. To minimize the impact of local failures, we need to build fault-tolerant services that handle certain types of failures gracefully.

Based on RisingStack’s Node.js consulting and development experience, this article introduces common techniques and architectural patterns for building highly available microservices systems.

If you are not familiar with the patterns described in this article, it does not mean that what you are doing is wrong, after all, building a reliable system costs extra.

Risks to microservices architecture

The microservice architecture disperses the business logic into various microservices, which communicate with each other through the network layer. Network communication introduces additional latency and complexity, requiring multiple physical and logical components to work together. The additional complexity of distributed systems increases the likelihood of network failures.

One of the biggest advantages of a microservice architecture over a monolithic one is that different teams can design, develop, and deploy their services independently. They have complete control over their microservices lifecycle. Of course, this also means that they have no control over the service dependencies, because control of the dependencies is in the hands of other teams. When adopting a microservice architecture, it is important to keep in mind that issues with distribution, configuration, and so on May cause a service provider to become temporarily unavailable.

Elegant service downgrades

Failure isolation can be achieved through the microservices architecture, that is, graceful service degradation can be achieved when components fail. For example, when a photo-sharing app fails, users may not be able to upload new images, but they can still view, edit and share existing ones.

Figure: Theoretical microservice failure isolation

In most cases, this elegant degradation of services is difficult to achieve because in distributed systems, applications depend on each other, and some fail-safe scheme (described later) needs to be applied to deal with temporary failures.

Figure: Interdependent services will all fail without failure backup.

Change management

Google’s Website reliability team found that 70 percent of failures were caused by system changes. Changing services, deploying new code, and changing configurations can introduce new defects or invalidate services.

In the microservices architecture, services are interdependent. So we want to minimize the probability of failure and limit the negative impact of failure. We need good change management policies and automatic rollback mechanisms.

For example, when new code is deployed or configuration changes are made, they are done on a small number of service instances, then monitored and rolled back as soon as key metrics are found to be abnormal.

Figure: Change management — Rollback deployment

Another solution is to run two production environments. Only one of the production environments is deployed at deployment time and the load balancer can be pointed to only once the environment is confirmed to be ok. This deployment is referred to as a blue-green or red-black deployment.

Rolling back code is not a bad thing. You can’t leave bad code in production and wonder what went wrong. So, roll back the code as quickly as necessary.

Health monitoring and load balancing

Service instances are always started, restarted, and stopped for various reasons (failure, deployment, or auto-scaling). This process makes the service temporarily or permanently unavailable, and to avoid problems, the load balancer needs to ignore the service instances in question because they are no longer capable of serving users or other subsystems.

The health of an application can be observed externally, for example by repeatedly calling the/Health endpoint to know the status of the application, or by having the application report its status. The service discovery mechanism continuously collects health information about service instances, and the load balancer should be configured to route traffic only to healthy service instances.

self-healing

Self-healing capabilities allow applications to self-heal in the event of a failure. If an application can recover from a failure state through a series of steps, it is said to be self-healing. In most cases, this is done through an external system. This system monitors the health of service instances and restarts them if they remain unhealthy for an extended period of time. Self-healing is useful most of the time, but it can cause problems if the app is constantly rebooted. This usually happens when an application is overloaded or a database connection times out.

It can be tricky to implement advanced self-healing schemes, such as in the case of a database connection timeout, where you need to add extra logic to the application to let the external system know that the service instance does not need to be restarted.

Invalid backup cache

Services fail for a variety of reasons, such as network problems. However, most of these errors are temporary, and the system’s self-healing capabilities and advanced load balancing features allow application instances to remain servicable in the event of these problems. This is where the failsafe cache comes in, providing the necessary data for the application.

Invalidation backup caches typically use two different expiration times, a short time, which indicates the normal cache expiration time, and a long time, which indicates the cache expiration time during a failure.

Figure: Invalid backup cache

Note, however, that invalidated backup cache data can be stale data, so make sure this is acceptable for your application.

Caching or invalidation backup caching can be set up through HTTP’s standard response headers. For example, max-age specifies the maximum expiration time of a resource and stale-if-error specifies the expiration time of the cache when a fault occurs.

Modern CDNS and load balancers offer a wide variety of caching and failure backup mechanisms, and you can create a caching solution for your own company.

retry

In some cases, we can’t cache the data, or we want to update the cached content and the update fails. At this point, we can retry because we think the resource will recover later or the load balancer will send the request to a normal instance.

Be careful when adding retry logic to your application, because a lot of retry operations can make things worse, or even prevent your application from recovering from a failure.

In distributed systems, a single microservice system may trigger multiple request or retry operations, resulting in a cascade effect. To minimize the impact of retries, you should limit the number of retries, using an exponential backoff algorithm that increments the delay between retries until the maximum number of retries is reached.

Retries are initiated by clients (such as browsers, other microservices, and so on), and the client does not know whether the previous request was successfully processed, so beware of idempotent issues when retrying. For example, there should be no double billing when retrying a purchase. You can help solve the idempotent problem by using a unique idempotent key for each transaction.

Rate limiting and load Inverter (Shedder)

A rate limit specifies how many requests an application can receive or process in a time window. With rate limits, you can filter out user requests or microservices that are making requests during peak traffic periods to ensure that your application is not overloaded.

You can also limit low-priority traffic and devote more resources to more critical transactions first.

Figure: Rate limiter limits peak traffic

Another rate qualifier, called a Concurrent Request limiter, can be useful in some situations. For example, you don’t want certain endpoints to be called multiple times, but at the same time you want to service all traffic.

Using a load instillator ensures that sufficient resources are always available to handle critical transactions. It reserves resources for high-priority requests that will not be used for lower-priority transactions. Load inverters decide how to reserve resources based on the state of the entire system, not on request bucket size. Load inverters help the system recover in the event of a failure because they keep the core functions of the system functioning in the event of a failure. The Stripe article details the rate limiter and load rewind.

Fail quickly and independently

In the microservices architecture, if services fail, we want them to happen quickly and independently. We can use the Bulkhead pattern to isolate problems at the service level. More on the partition mode later.

If a service component fails, let it fail as soon as possible because we don’t want to waste too much time waiting for it to time out. There’s nothing more frustrating than a pending request and an unresponsive interface that wastes resources and impacts the user experience. In a service ecosystem where services call each other, we want to prevent an avalanche of pending request operations.

The first thing you might think of is defining a secondary timeout for each service invocation, but the problem is that you don’t know exactly how long a timeout is appropriate because sometimes problems such as network failures affect only one or two operations. Obviously, if this is the case, then a few requests should not be rejected because they are out of date.

Arguably, using timeouts to implement fast failures in microservices architectures is an anti-pattern and should be avoided as much as possible. Instead, we can use a circuit breaker pattern, which determines whether a service has failed based on statistics of successful and failed operations.

partition

In shipbuilding, partitions are used to divide a ship into sections so that if a leak occurs in the hull, the leaking section can be sealed separately.

The concept of partitions is also used in software development to separate resources.

By applying the partition pattern, we can prevent limited resources from being exhausted. For example, if we have two operations for the database, we can use a connection pool instead of one. This way, if some operations time out or overuse the pool, operations on the other pool will not be affected.

Loop breaker

To limit the duration of an operation, we can define a timeout for the operation. The timeout mechanism prevents suspension operations for a long time and ensures normal system response. However, using a fixed timeout mechanism in a microservice architecture is an anti-pattern, because our environment is highly dynamic, so it is almost impossible to define the right bound time for every case.

Instead of using a fixed timeout mechanism, we can use a circuit breaker. Circuit breakers get their name from real-world electronics because they behave very similarly. They protect resources and help with system recovery. They can be useful in distributed systems, especially when repeated failures create an avalanche effect that threatens the entire system.

If a certain type of error occurs more than once in a certain period of time, the circuit breaker breaks. An open circuit breaker blocks subsequent requests, just like a circuit breaker in an electronic component. Circuit breakers close after a period of time, giving underlying services more room to recover.

Keep in mind that not all errors need to trigger a circuit breaker. For example, you might want to ignore problems on the client side, such as requests that contain 4XX response codes, but at the same time want to preserve 5XX server-side errors. Some circuit breakers may be partially open or closed. At this point, the service sends a request to check the availability of the system and rejects other requests. If the request to check the availability of the system returns successfully, the circuit breaker is closed and subsequent requests continue to be processed. Otherwise, it stays open.

Figure: circuit breaker

Failure testing

We should constantly test our systems for common problems to ensure that our services can handle failures.

For example, we can use an external service to identify a set of service instances and then randomly terminate one of them. This allows you to know how to respond to a single instance failure, and of course, you can shut down an entire group of services to simulate a cloud outage.

Netflix’s ChaosMonkey is a very popular elasticity test tool.

conclusion

Implementing and running reliable services is not an easy task. It takes a lot of effort and it costs your company a lot.

Reliability can be layered and involves many aspects, and you need to find the right solution for your team. You should consider reliability as a factor in your business decisions and allocate adequate budget and time to it.

Designing a Microservices Architecture for Failure

Thanks to Guo Lei for correcting this article

To contribute or translate InfoQ Chinese, please email [email protected]. You are also welcome to follow us on Sina Weibo (@InfoQ, @Ding Xiaoyun) and wechat (wechat id: InfoQChina).