This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.
This is A series of articles about experiment and decision making in Netfilx. It systematically introduces the importance of experiment to Netflix, the most important tool of experiment –A/B testing, and the importance of data science and culture in the experiment and decision making process. This is the second in a seven-part series. 原文 : What is an A/B Test?[1]
- Netflix’s decision making
- What is A/B testing? 👈
- False positive and significance statistics of A/B test results
- False negatives and statistical efficacy of A/B test results
- Build confidence in your decision making
- Experimentation is the main focus of data science
- Culture of learning
This is the second article in A series on how Netflix is making decisions and innovating products based on A/B testing. seePart 1 :Netflix decision making. Subsequent articles will detail A/B testing statistics, the Netflix experiment, how Netflix is investing in infrastructure to support and expand the experiment, and the importance of Netflix’s experimental culture.
A/B test is A simple controlled experiment. For example, let’s say we’re considering a new product experience and wonder if it would be better for our members to display all the box images upside down on the TV UI.
Figure 1: How to determine if product Experience B(with the upside down box pattern) is a better experience for our members?
To conduct the experiment, we take a subset of the members (usually a simple random sample [2]) and randomly assign [3] the sample evenly into two groups. “Group A,” often referred to as the “control group,” continued to get A basic Netflix user interface experience, while “Group B,” often referred to as the “test group,” got A different experience based on A specific assumption about improving the user experience (more on these assumptions below). Here, group B receives the box picture upside down.
Wait for A while, and then compare the various measurements for groups A and B, some of which are specific to A given hypothesis. In the UI experiment, for example, we looked at engagement in different variations of a new feature. As an experiment to provide more relevant results in the search experience, we measure whether users find more content to watch through search. In other types of experiments, we might focus on more technical metrics, such as application load time, or the video quality that can be provided under different network conditions.
Figure 2: Simple A/B test. We randomly assigned Netflix members into two groups: “Group A” received the current product experience, while “Group B” received some changes that we considered to be improvements to the Netflix experience. Here, the “B” team gets the “upside down” product experience. We then compared the indicators between the two groups. Importantly, randomization ensured that everything else between the two groups remained the same.
In many experiments (including the example of reversing the box picture), we need to think carefully about what the parameters are telling us. Let’s say we look at click-through rates, which measure the percentage of people who click on a game per user experience. As an indicator of the success of the new UI, this in itself can be misleading, as members may simply click on the title to make it easier to read the product experience. In this case, we might also want to assess how many members then leave rather than continue browsing.
In all cases, we also need to look at more general metrics designed to capture the joy and satisfaction Netflix brings to our members. These include measures of Netflix user engagement: Will the idea we’re testing help users choose Netflix as their way of relaxing and entertaining on a given night?
There are also a lot of statistics involved — how much difference is considered significant? How many members do we need to detect an effect of a given size in a test? How to analyze data most effectively? We’ll discuss some of these details in a future article, focusing on high-level intuition.
All else being equal
Because we created the control group (” A “) and the test group (” B “) using A random assignment method, we ensured that individuals in both groups were balanced on average across all dimensions that might be meaningful to the test. For example, random assignment ensures that there is no significant difference in average length of membership between the two groups, content preference, major language choice, etc. The only difference between the two groups was the new experience we were testing to make sure we weren’t biased in our estimates of the impact of the new experience.
To understand how important this is, let’s consider another way to make a decision: We could roll out the new upside down box experience to all Netflix members and see if one of our metrics changes much. If there is positive change, or no evidence of any meaningful change, we will keep the new experience, and if there is evidence of negative change, we will go back to the previous product experience.
Let’s say we do this, and on the 16th of a month we flip the experience upside down. What would you do if you collected the following data?
Figure 3: Hypothetical data for day 16 upside down box product experience release
The numbers look great, we’ve launched a new product experience and membership engagement is way up! But if you have that data, and product B turns the block diagram of all the UI upside down, is the new product experience really good for members? How confident do you really feel about that?
Do we really know that new product experiences are the cause of increased engagement? Are there any other possible explanations?
If you knew that on the same day that the (hypothetical) Upside down product experience was launched, Netflix released a new hit show, like a new season of Stranger Things [4] or Bridgerton [5], Or hit movies like Army of the Dead [6]? Now we have more than one possible explanation for the increase in engagement: it could be a new product experience, a hit on social media, a combination of the two, or something else. The point is, we don’t know if new product experiences lead to increased engagement.
If we conduct A/B test with an upside down product experience, with one group receiving the current product (” A “) and the other group receiving the upside down product (” B “), and collect the following data:
Figure 4: Hypothetical data from A/B testing of the new product experience
In this case, we came to a different conclusion: upside-down products are generally less engaging (no surprise!). “, both of which increase with the release of blockbusters.
A/B testing allows us to make causal statements. We randomly assigned members to groups A and B, and with everything else being the same between the two, we introduced the upside down product experience to group B, so it’s quite possible to conclude (more on that next time) that the upside down product led to A decrease in user engagement.
This is an example of extreme assumptions, but in general, there are things we can’t control. If we present an experience to everyone and measure only one metric before and after the change, there will be relevant differences between the two time periods that prevent us from making causal conclusions. Maybe it’s a new series, maybe it’s a new product partnership that makes Netflix available to more people, there’s always something we don’t know about. Running A/B tests, where possible, allows us to confirm cause and effect and make changes to the product with confidence that our members have voted through their actions.
It all starts with an idea
A/B testing starts with an idea — some UI changes, A personalized system to help members discover content, A sign-up process for new members, or any other idea that we think has the potential to have A positive effect on members based on our experience at Netflix. Some of the ideas we tested were incremental innovations, like improving the text that appears in Netflix products, and some were more ambitious, like testing the “Top 10” list that Netflix now displays on the UI.
Like all Netflix innovations around the world, the Top10 started as an idea and became a testable hypothesis. The core idea here is that shows that are popular in every country will benefit our members in two ways. First, by presenting popular shows, it can help members share their feelings and establish connections with others through discussing popular shows. Second, it helps users choose great content by satisfying their inherent desire to participate in a shared conversation.
Figure 5: Sample Top 10 experiences on the Web UI
Next, we turn this idea into a testable hypothesis that says, “If we change X, it will improve the membership experience in a way that increases the measure Y.” For the Top 10 example, the hypothesis is: “Showing users a Top 10 experience will help them find content worth watching, increasing user enjoyment and satisfaction.” The main decision metric for this test (and many others) is to measure user engagement with Netflix: Will the ideas we’re testing help users choose Netflix as their entertainment on a given night? Studies show that this metric (details omitted) correlates with the probability of membership renewal in the long run. We also run tests in other areas, such as the sign-up page experience or server-side infrastructure, and the key decision metrics are different, but the principle is the same: During testing, we need to measure whether we are providing more value to long-term members.
In addition to the primary decision metrics for testing, there are secondary metrics to consider and how they will be affected by the product features we are testing. The goal here is to clarify the causal chain, from how user behavior will respond to new product experiences to changes in our key decision metrics.
Specific products and the main decision-making index changes between causal chains, and monitor the chain on the secondary indicators, help us build confidence, believe that any changes are major indicators we assume that the result of the causal chain, rather than a new features bring unexpected results (or is the result of a mistake – there will be more discussed in a later article). For our Top 10 tests, engagement is our primary decision metric — but we also look at views of those shows that appear in the Top 10 list, views generated through that list versus the rest of the interface, and so on. If the Top 10 experience is really good for members, then we would expect the test group to increase viewing and engagement of shows that appear in the Top 10 list.
Finally, because not all of the ideas we tested were correct (sometimes new features are buggy too!) , we also see indicators as “guardrail”. Our goal is to limit any adverse consequences and ensure that the new product experience does not have an unintended impact on the member experience. For example, we could compare customer service contacts between the control group and the test group to see if the new feature increased customer service contact rates, which might indicate confusion or dissatisfaction among members.
conclusion
This article focuses on building the foundation of A/B testing, understanding why running A/B testing is more important than rolling out features, looking at metrics before and after making changes, and how to turn an idea into A testable hypothesis. Next we will discuss the basic statistical concepts used when comparing the indicators of the test and control groups.
References: [1] What is an A/B Test? Netflixtechblog.com/what-is-an-… [2] Simple Random from: en.wikipedia.org/wiki/Sampli… [3] the Random the Assignment: en.wikipedia.org/wiki/Random… [4] Stranger Things: www.netflix.com/title/80057… [5] Bridgerton: www.netflix.com/title/80232… [6] Army of the Dead: www.netflix.com/title/81046…
Hello, MY name is Yu Fan. I used to do R&D in Motorola, and now I am working in Mavenir for technical work. I have always been interested in communication, network, back-end architecture, cloud native, DevOps, CICD, block chain, AI and other technologies. The official wechat account is DeepNoMind