This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.
This is A series of articles about experiment and decision making in Netfilx. It systematically introduces the importance of experiment to Netflix, the most important tool of experiment –A/B testing, and the importance of data science and culture in the experiment and decision making process. This is the third in a seven-part series. Originally: Interpreting A/B test results: False positives and statistical significance[1]
- Netflix’s decision making
- What is A/B testing?
- False positives and statistical significance of A/B test results 👈
- False negatives and statistical efficacy of A/B test results
- Build confidence in your decision making
- Experimentation is the main focus of data science
- Culture of learning
This is the third article in A series on how Netflix is making decisions and innovating products based on A/B testing. Want to read the previous article? seePart I (Netflix’s Decision Making)andPart 2 (What is A/B testing?). The following articles will detail the Netflix experiment, how Netflix has invested in the infrastructure to support and expand the experiment, and the importance of Netflix’s experimental culture.
In part 2: What is A/B Testing, we talked about testing the Top 10 lists on Netflix and how the main decision metric for the testing was to measure customer satisfaction with Netflix. If such tests show a statistically significant improvement in key decision indicators, then the feature is likely to be rolled out to all of our members. But, given the results of the tests, how do we know if we’ve made the right decision? It is important to acknowledge that no decision-making method can completely eliminate uncertainty and the possibility of error. Using frameworks based on hypothesis generation, A/B testing, and statistical analysis, we can carefully quantify uncertainty and understand the probability of making different types of errors.
There are two kinds of mistakes we can make when acting on test results. False positives (also known as type 1 errors) occur when test data indicate a meaningful difference between the control group and the test group when in fact there is no difference. It’s like having a positive test when you’re healthy. Another mistake we can make in determining test results is false negative (also known as type 2 error), which happens when the data doesn’t indicate a meaningful difference between the test group and the control group, when in fact there is. It’s like a medical test that comes back negative, but you actually have the disease you were tested for.
As another way to build intuition, consider the real reason the Internet and machine learning exist: if the picture shows a cat, label it “cat.” For a given image, there are two possible decisions (apply the label “cat” or “not cat”), as well as two possible facts (the image either has a cat or doesn’t have a cat). This leads to a total of four possible outcomes, as shown in Figure 1. The same applies to the A/B test: we make two possible decisions based on the data (” There is enough evidence to conclude that the Top 10 satisfied members “or” insufficient evidence “), and two possible facts that we are completely uncertain about (” The Top 10 really satisfied members “or” the Top 10 did not satisfy members “).
Figure 1: Four possible outcomes for marking whether the image is “cat.
The uncomfortable truth about false positives and false negatives is that we can’t make them disappear at the same time. In fact, it’s a tradeoff. Designing an experiment with a very low false positive rate inevitably increases the false negative rate and vice versa. In practice, our goal is to quantify, understand, and control these two sources of error.
Next, we’ll use simple examples to establish an intuitive sense of false positives and related statistical concepts, and we’ll do the same for false negatives in the next article in this series.
False positives and statistical Significance
With good assumptions and A clear understanding of the key decision metrics, it’s time to move on to designing the statistical aspects of A/B testing. The process usually begins with the determination of an acceptable false positive rate. By convention, this false positive rate is usually set at 5% : for tests with no significant difference between the test group and the control group, we mistakenly conclude that there is a 5% “statistically significant” difference, and tests performed at a 5% false positive rate are said to have been performed at a 5% significance level.
Using the 5% significance level convention is uncomfortable. By following this convention, we accepted the fact that when the data from the test and control groups were not significantly different for our members, we made an error 5 percent of the time. That means we’re 5 percent more likely to tag photos that aren’t cats as cats.
The false positive rate is closely related to the “statistical significance” of the observed measurement difference between the test group and the control group, which is measured by the p-value [2]. The p-value is the probability of extreme results on our A/B tests, to determine whether there is really no difference between the test group and the control group. For more than a century, students of statistics have been confused about the importance and p-values of statistics, and an intuitive way to understand this is through simple games of probability, where we can calculate and visualize all the relevant probabilities.
Figure 2: Thinking about simple games of chance, like this coin showing Julius Caesar, is a great way to build statistical intuition
Suppose we want to know if a coin is fair, that is, if the probability of finding heads is 0.5(or 50 percent). This may sound like a simple scenario, but it is directly relevant to many business scenarios (including Netflix), and our goal is to understand the probability that a new product experience will affect a user’s activity, be it clicking on a UI feature or keeping Netflix for a month. So we can understand A/B testing directly through A simple coin game.
To determine whether a coin is fair, we run the following experiment: Toss a coin 100 times and calculate the percentage of heads. Even if the coin was perfectly uniform, we wouldn’t expect 50 heads and 50 tails because of randomness or “noise” — but how much of a deviation from 50 heads and 50 tails? When do we have enough evidence to determine the basic claim that the coin is, in fact, fair? If 60 out of 100 flips are heads, do you want to conclude that the coin is unfair? 70 times? We need a way to fine-tune the decision-making framework and understand the associated false positives.
To build intuition, let’s do a mental exercise. First, we assume that the coin is fair — this is our “null hypothesis,” which always represents the status quo or equality. We then searched the data for strong evidence against the null hypothesis. To determine what constitutes convincing evidence, we calculate the probability of each of the possible outcomes and assume that the null hypothesis is correct. For the coin flip example, this is the probability of getting 0 heads, 1 heads, 2 heads, and so on, up to 100 heads (assuming the coin is fair). Skipping the math, each of the possible outcomes and their associated probabilities in Figure 3 are represented by black and blue bars (ignore the colors for now).
Then we can calculate the probability distribution of the result, assuming the coin is fair, and compare it with the data we collected. Suppose we observe that 55% of the 100 flips are heads (solid red line in Figure 3). To quantify whether this observation was enough to prove that the coin was uneven, we calculated the probability of each outcome that was less likely than what we observed. Here, since we are not assuming that heads or tails are more likely, we add up the probability that more than 55% is heads (near the red line on the right) and the probability that more than 55% is negative (near the red line on the left).
This is the imaginary p-value: the probability of seeing a result as extreme as the one we observe, given the null hypothesis. In our example, the null hypothesis is that the coin is fair, and 55% of the 100 flips are observed to be heads, with a p value of about 0.32. The explanation is as follows: We repeated the experiment of flipping a coin 100 times many times and calculated the proportion of heads. If the coin was fair (the null hypothesis is true), 32% of the experiment’s results would include at least 55% of heads or at least 55% of tails (the results are at least not what we actually observe).
Figure 3: Toss a fair coin 100 times and the probability of each outcome is expressed as the score for heads
How do we use the p-value to determine whether there is statistically significant evidence that the coin is uneven? Or is the experience of our new product an improvement on the status quo? Going back to the 5% false positive rate that we initially agreed to accept: we concluded that if the p value was less than 0.05, there would be a statistically significant effect. This makes the intuitive sense that if our result is highly unlikely to happen under the assumption of a fair coin, we should reject the null hypothesis, which is that the coin is fair. In the case of observing 55 heads out of 100 coin flips, we calculate a p value of 0.32. Because the p value is greater than the significance level of 0.05, we conclude that there is no statistically significant evidence that coins are uneven.
We can draw two conclusions from the experiment or A/B test: either the valid conclusion (” the coin is uneven “, “Top 10 increases member satisfaction”) or the conclusion that: There is not enough evidence to draw valid conclusions (” it cannot be concluded that the coin is uneven, “” it cannot be concluded that the Top 10 increases member satisfaction”). It’s much like a jury trial, with two possible outcomes: “Guilty” or “not guilty” — “not guilty” is very different from “innocent.” Similarly, this A/B test (probabilistic [3]) approach does not allow us to conclude that there is no effect — we never conclude that coins are uniform or that features of A new product have no effect on our members. We simply conclude that we have not collected enough evidence to reject the null hypothesis that there is no difference. In the coin example above, we observed that 55% of the 100 flips were heads, and concluded that we did not have enough evidence to prove that the coin was uneven. The point is, we haven’t concluded that the coin is uniform — after all, if we had collected more evidence, say by flipping the same coin 1,000 times, we might have found enough compelling evidence to reject the uniform null hypothesis.
Rejection Regions and Confidence Intervals
There are two other concepts closely related to p-values in A/B testing: the rejection region of the test and the confidence interval of the observation. We will introduce both concepts in this section based on the coin example above.
Reject area. Another way to set up decision rules for the test is based on what’s called a “rejection zone” — a set of values that we can conclude is unfair. To calculate the rejection region, we again assume that the null hypothesis is true (the coin is fair) and then define the rejection region as the set of least likely outcomes whose probabilities sum is not greater than 0.05. The rejection region consists of the most extreme outcomes, provided that the null hypothesis is true — that is, the outcome with the strongest evidence against the null hypothesis. If an observation falls in the rejection region, we conclude that there is statistically significant evidence that the coin is uneven and “reject” the null hypothesis. In the case of the simple coin experiment, the rejection region corresponds to less than 40% or more than 60% of the observed heads (shown by the blue shaded bar in Figure 3). We call the boundaries of the rejection region (40% and 60% positive here) the test threshold.
The rejection region is equivalent to the P-value, and the two can derive the same judgment: p value is less than 0.05 if and only if the observed value is located in the rejection region.
Confidence intervals. So far, we have constructed decision rules by starting from the null hypothesis, which always has no change or equivalent statement (” the coin is fair “or” product innovation does not affect member satisfaction “). We then define the possible outcomes under this null hypothesis and compare our observations with that distribution. To understand confidence intervals, it helps to turn the question to observation. Then, we perform a thought exercise: based on the observation, suppose we specify a 5% false positive rate, what value will cause the null hypothesis not to be rejected? In our coin toss example, if we observed that 55% of the 100 flips were heads, we would not reject the null hypothesis for fair coin. If the probability of heads is 47.5%, 50%, or 60%, the null hypothesis will not be rejected. The probability that we do not reject the null hypothesis ranges from 45% to 65%(Figure 4).
The range of values is the confidence interval: a given set of data from the test that, under the null hypothesis, does not result in rejection. Because we plotted the interval using a 5% significance level test, we created a 95% confidence interval. Or, to paraphrase, it means that under repeated experiments, the confidence interval will cover 95% of the true value (here is the actual probability that the coin is heads).
The confidence interval and the p-value are equivalent, and the same judgment can be derived from both: when and only when the p-value is less than 0.05, the 95% confidence interval does not cover the null value. In both cases, we reject the null hypothesis with no effect.
Figure 4: Build confidence intervals by mapping a set of values that, when used to define a null hypothesis, do not result in rejection of a given observation
conclusion
Through a series of coin-flipping thinking exercises, we have built up an intuitive feel for false positives, statistical significance and p-values, rejection regions, confidence intervals, and two decisions we can make based on test data. These core concepts and intuitions map directly into comparing test group and control group data in A/B testing. We define an undifferentiated “null hypothesis” : the “B” experience does not change to affect member satisfaction. We then run the same thought experiment: assuming there is no difference in member satisfaction, what are the likely outcomes and associated probabilities of the measurement difference between the test group and the control group? Then, we can compare the experimental observations to this distribution, as in the coin case, compute a p-value, and draw conclusions about the test, we can define the rejection region and compute the confidence interval.
But false positives are just one of two mistakes we can make based on our test results. In the next article in this series, we will discuss another type of error, false negatives, and the concepts closely related to statistical power.
References: [1] Interpreting A/B test results: false positives and statistical significance: Netflixtechblog.com/interpretin… [2] The ASA Statement on p – Values: Context, Process, and The Purpose: www.tandfonline.com/doi/pdf/10…. [3] Frequentist Inference: en.wikipedia.org/wiki/Freque…
Hello, MY name is Yu Fan. I used to do R&D in Motorola, and now I am working in Mavenir for technical work. I have always been interested in communication, network, back-end architecture, cloud native, DevOps, CICD, block chain, AI and other technologies. The official wechat account is DeepNoMind