Netflix: training smarter chaos monkey | IDCF

Before the speech

At Netflix, we’ve found that proactive fault testing is a great way to help us spot potential problems in a production environment and provide our customers with a reliable product.

All the efforts we made (though some of them were manual) helped us make it through the holiday without any accidents (so much the better if we were chosen for duty on New Year’s Eve! 😊).

But who likes manual work? In addition, we only test for expected failures, and each exercise typically only tests for a single service or component. We must do better!

Imagine a monkey crawling through your code and infrastructure, injecting small glitches and checking to see if they cause user impact.

While exploring the construction of this monkey, we discovered a fault testing method called Molly, developed by Peter Alvaro. (http://www.cs.berkeley.edu/~p)…

Since we already have a Fault Injection service called FIT (Fault Injection Test), I believe we can build a prototype implementation in a short time. (https://medium.com/@Netflix_T)…

We think it makes a lot of sense to apply the concepts outlined in Molly’s paper to mass production. So we got in touch with Peter to see if he would be interested in co-prototyping. The following is the introduction of our cooperation results.

1. Explore algorithms

“Tree-driven fault injection (LDFI) works backwards from the correct system output to determine whether an injection fault will change the output.”

Molly first looked at everything about the successful request and then asked, “What factors would change this result?” Take the following simplified request as an example:

Fault sets: | | B) P (A | R

To begin with, we assume that all are necessary. Then, assume that the user impact is caused by A or R or P or B failure, where A represents the API (and so on). We first randomly select and re-execute the request from a list of potential failure points, and inject the failure at the selected point.

Three potential outcomes:

Request failure: Facing failure (future experiments need to be adjusted to include this failure)
Request successful: service/point of failure is not important
Request succeeded: Point of failure is automatically restored by takeover (failover or fallback)

In this example, Ratings failed and the request was successful, resulting in the following chart:

Fault set: (1) | | P B) & (A | B | | P R))

You can learn more about the request and the point of failure. Since the Playlist is a potential point of failure, we then invalidate it to produce the following chart:

Fault set: (A) | | PF B & (A | | P B) & (A | B | | P R))

This illustrates the third potential outcome above. The request was still successful because the PlaylistFallback operation was performed. So, we have a new point of failure to explore, and we update the experimental scope, and we clean it up, inject it, and repeat it until there are no more points of failure to explore.

Molly has no rules about how to explore this search space. The way we do this is to list all the possibilities of failure at the point of failure, and then select randomly from the smallest combination. For example, [{A}, {PF}, {B}, {P, PF}, {R, A}, {R, B}…] .

Exploration begins with all single points of failure: A, PF, B; Then continue with the combination of the two failures, and so on.

Second, automatic fault test implementation

2.1 Dependent tree carding

What is the dependency tree for Netflix requests?

Using the link tracking system, the request dependency tree can be constructed for the whole micro-service. With FIT, we can get more information in the form of “injection points”. These are key inflection points where the system may fail.

Injection points, including Hystrix command execution, cache lookups, database queries, HTTP calls, and more.

The data provided by FIT helps us build a more complete dependency tree, which is the input for the algorithm’s analysis.

In the example above, we saw a simple service request tree. Here is the extended request dependency tree with FIT data:

2.2 Criteria for success

What is “success”? What matters most is the user experience, and there needs to be a yardstick to reflect that. To do this, we leverage the set of metrics reported by the device. By analyzing these metrics, you can determine if the request has any user impact.

Another, simpler approach is to rely on the HTTP status code to determine the result of success. But the status code can be misleading, because some frameworks return a “200” status on partial success and place user-influenced error descriptions only after the content is requested.

Currently, only a portion of Netflix requests have corresponding device reporting metrics. By adding metrics for device reporting for more request types, we had the opportunity to extend automated fault testing to cover a wider range of device traffic.

2.3 Idempotent operations

Support for playback requests makes Molly easy to use, but we can’t do that right now. You don’t know when a request is received whether it is equivalent or not, and whether it can be safely played back.

To compensate for this, we split the request into multiple equivalent classes, each of which “has” the same functionality, that is, makes the same dependency call and fails in the same way.

To define the request class, we focus on the information received from the request:

Path (netflix.com/foo/bar)
Parameters (? Baz = boo)
Information about the device that made the request

Let’s first try to see if there is a direct mapping between these request capabilities and dependencies. This is not the case. Next, we explored ways to use machine learning to find and create these mappings. This seems promising, but it will take a lot of work to get it right.

Instead, we narrow it down to examining only the requests generated by the Falcor framework. These requests provide a set of JSON paths through the query parameters for the request to load, namely “video,” “profile,” and “image.” We found that these Falcor path elements matched the internal services needed to load them.

Future work needs to find a more general way to create these request category mappings so that we can extend our testing beyond Falcor requests.

These types of requests vary depending on the code that Netflix engineers write and deploy. To counteract this bias, we sample the set of metrics reported by the device on a daily basis to analyze potential request categories. An old category with no traffic expires it, and a new code path creates a new category.

3. User influence

Keep in mind that the goal of this exploration is to find the failure and fix it before it affects a large number of users. Running our tests caused a significant amount of user influence, which was unacceptable.

To mitigate this risk, we have reinvented our approach to exploration so that only a small number of experiments are conducted in any given time period.

The scope of each experiment was limited to one request category, the experiment ran for a short time (20-30 seconds), and the number of users affected was small.

We want to have at least 10 good sample requests per experiment.

To filter out false positives, we look at the overall success rate of the experiment, where more than 75% of the requests that failed were marked as detected failures. Since our request class mapping is not perfect, we will also filter out failed requests that for some reason did not execute to test.

Let’s say we can run 500 experiments a day. If we could affect 10 users per run, the worst impact would be 5,000 users per day.

But not every experiment results in failure. In fact, most experiments are successful.

If 10% of the trials are found to fail (a higher estimate), then we are actually impacting 500 user requests per day, and retries can be further mitigated.

When billions of requests are being processed every day, the impact of these experiments is minimal.

Bear fruit

We were fortunate that the “App Boot” request, one of the most important Netflix requests, met our criteria for exploration. The request loads the metadata needed to run the Netflix app, as well as the user’s initial list of videos.

This is a critical time for Netflix, and we want to win over our customers by offering them a proven experience from the start.

This is also a very complex request, involving dozens of internal services and hundreds of potential points of failure.

Brute force exploration of this space requires 2^100 iterations (about 30 zeros), and our method is able to explore it in about 200 experiments and finds five potential failures, one of which is a combination of failure points.

What shall we do once the fault is found?

Well, you have to do it manually. We are not at the point where we can fix things automatically.

In this case, we have a list of known points of failure and a “scheme” that allows someone to reproduce the failure using FIT. From this, we can verify the failure and determine the solution.

We were happy to build this prototype implementation, verify the implementation, and use it to find real bugs.

We also want to be able to extend it to automatically search for more Netflix request space, find more potential points of failure that affect users, and resolve them before they actually happen!

Source: Chaos Engineering Practice

Originally written by Kolton Andrus and Ben Schmaus

Automated Failure Testing, aka Training Smarter Monkeys

Source: Netflix Technology Blog

Disclaimer: The article was forwarded on IDCF public account (devopshub) with the authorization of the author. Quality content to share with the technical partners of the Sifou platform, if the original author has other considerations, please contact Xiaobian to delete, thanks.

June every Thursday evening at 8 o ‘clock, [winter brother has words] happy a “summer”. The address can be obtained by leaving a message “happy” on the public account

0603 invincible brother “IDCF talent growth map and 5P” (” end-to-end DevOps continuous delivery (5P) class “the first lesson)
0610 Dong Ge “Take you to play innovative design thinking”
0617 “What is Agile Project Management?”
0624 “Agile Leadership in the Age of Vuca”

Netflix: training smarter chaos monkey | IDCF

Before the speech

1. Explore algorithms

Second, automatic fault test implementation

2.1 Dependent tree carding

2.2 Criteria for success

2.3 Idempotent operations

3. User influence

Bear fruit

Related Posts

[Acquisition Technology] On the prevention and control of property loss

Simple use of Arthas under Windows

Hongmeng system test five elements