In this paper, starting from vivo Internet technology WeChat public links: mp.weixin.qq.com/s/mO5MdwG7a… Author: DuZhimin

More and more companies are experimenting with ABTest, either by building their own systems or relying on third-party systems. So what are the basics that we need to know when we do an ABTest? How do you do the AB experiment step by step? This article will lead you to see what it is according to the process of AB experiment.

One, the introduction

In the business development process of Internet companies, user growth is the eternal theme, because there is no growth, there is no development, so in the early stage of business development, the faster the product iteration speed is often the better, in a word: “how fast, how to”.

However, when the business develops to a certain stage, the dividend of savage growth gradually fades, and the user growth space becomes less obvious under the visible strategy, how to reasonably plan the product iteration strategy becomes particularly important, and how to judge whether the product strategy is effective often requires data to speak. The results determine the longevity of the product or strategy and the allocation of resources that go with it, since we don’t waste resources on ineffective products or strategies.

So what tools or means can ensure the effective implementation and implementation of data-driven strategy? Many companies are doing this through ABTest and building an experimental infrastructure platform to match it.

In 2019, we built Vivo’s ABTest experimental platform (Hawking experimental Platform). Up to now, the platform has been connected to 14 business parties and conducted 40 experiments. In the process of communication with business parties, we found that our understanding of ABTest is not enough, so let’s learn related knowledge points together. Better understand ABTest.

ABTest is usually used to compare the effect of setting different values of A variable in different versions of A product (such as A red button on one page and A blue button on another), where version A is the version currently in use and version B is the improved version. In the experiment is generally compared with the experimental group and the control group in some indicators whether there are differences, of course, more often is to see whether the experimental group compared with the control group a certain indicator performance is better. Such comparison is called two-sample hypothesis test in statistics, that is, the experimental group and the control group are two samples, and the null hypothesis H0 of the hypothesis test is that there is no significant difference between the experimental group and the control group; Alternative hypothesis H1: There is significant difference between the experimental group and the control group.

Most of the time, we focus on the proportion category, such as CTR, conversion rate, retention rate, etc. The characteristic of these proportional values is that for a given user (each sample point in the sample) there are only two outcomes, “success” or “failure”; For the whole, the number is the percentage of users whose results were “successful”. For example, conversion rate, there is either successful conversion or unsuccessful conversion for a particular user. Hypothesis testing of proportional-like values is called two-sample proportional hypothesis testing in statistics.

Let’s explain it by the experiment of account device login rate.

2. Preparation before experiment

1. Before we do the experiment, let’s answer the following questions:

1.1. What do you want to prove by your experiment?

A: I would like to improve the device login rate of my account by changing the color of the device login button

1.2. What will your control group and experimental group look like?

A: The control group is what it looks like now. Please see the picture below. The login button has a blue background.

[Perfect first step: Determine the experimental group and control group]

1.3. How to avoid confounding factors?

(Confounders are individual differences in subjects that are not factors you are trying to compare, but ultimately make the analysis less sensitive — people in different cities, people of different ages, gender… During the experiment, the influence of confounding factors on the results should be avoided as far as possible)

Answer: what you ask here is when we are doing an experiment, how to determine the sample of the control group and the experimental group, it is to make the sample individual difference of the experimental group and the control group as much as possible. How to assign users to each scheme The Hawking platform has already done this for us, and I have learned that the platform supports many different strategies for dividing users (unique identifier hashing, specifying specific users, by user tag…). We are going to use the strategy of unique identifier hashing, which is an excellent way to avoid confounding factors by randomly selecting from the requesting users: because the factors that might be confounding factors end up having equal weight in the control and experimental groups.

The following figure shows the shunting strategy supported by the experimental platform:

Perfect Step 2: Eliminate confounding Factors

2. Sample size

How many samples do YOU need for A/B experiment? That’s the question we all have to answer when we do experiments. (In fact, for Internet applications, the traffic is very large, and the small sample size is a factor to be considered in the experiment, but we still need to talk about it here, because there are some other concepts involved, we also need to understand)

2.1. Why do we need to calculate sample size?

In theory, the larger the sample size, the better:

  • Intuitively, when the number of samples is small, the experiment is prone to be biased by new sample points, resulting in unstable experimental results and difficult to draw a definitive conclusion. On the contrary, with more samples, the experiment has more “evidence”, and the “reliability” of the experiment is stronger.

In practical operation, the sample size should be as small as possible, because:

  1. Limited traffic: Large companies have enough users to run dozens or even hundreds of experiments at the same time. But small companies have so little traffic and so many new products to develop. Under the condition that the samples of different experiments do not overlap, the speed of product development will be greatly reduced.

  2. High cost of trial and error: Suppose we run the experiment with 50% of the users, but unfortunately, after a week it turns out that the total income of the experimental group has decreased by 20%. Your experiment cost the company 10% in a week. The cost of trial and error is too high.

2.2. Confidence and detection efficiency

To understand these two concepts, let’s take A look at the basics of A/B experiments.

First, two assumptions of A/B testing:

The Null hypothesis (also known as H0) : a hypothesis that we hope to disprove by experimental results. In our example, the null hypothesis can be expressed as “the orange button and the blue button have the same device login rate”. Alternative hypothesis (also known as H1) : a hypothesis that we wish to verify by experimental results. In our example, it can be expressed as “the orange button and the blue button have different device login rates”.

The essence of A/B testing is to use experimental data to determine whether H0 is correct or not. Here are four things that can happen:

1. There is no difference in device login rate (H0 is correct), but the experimental analysis results show that there is a difference:

Because of these errors, we call them Type I errors, and we denote the probability of Type I errors by alpha. Confidence = 1-α. The first type of error is when the new product doesn’t actually improve the business, but we mistakenly think it does. Such analysis results not only waste the company’s resources, but also may have a negative guide to the product.

So, when doing A/B testing, we want type 1 errors to be as low as possible. In practice, we set an artificial upper limit for alpha, usually 5%. In other words, when we do experiments, we make sure that the probability of type 1 errors is never more than 5%.

2. The device login rate is different (H1 is correct), but the experimental analysis results show no difference:

We’re wrong again, and this Type of error is called a Type II error, which is denoted by beta. We generally define the second type of error beta no more than 20%.

3. Case 2 and case 3 are two scenarios that are correctly judged, and we call the probability of making such a correct judgment as detection efficiency.

The fundamental purpose of our experiment is to detect the difference in device login rate between orange button and blue button. If the performance of the test is low, it proves that even if the new product does work, the experiment will not detect it. In other words, our experiment does not work with eggs.

According to the definition of conditional probability, detection efficiency = 1 -β = 80%.

The important idea of A/B testing is that it is better to kill four good products than to put one bad product on the market.

2.3 calculation formula of sample size

In most cases, we don’t need to know the calculation formula of sample size in detail, so here is the formula for you to study together.

In the above formula, P1 is called the base value, which is the current value of the key indicator concerned by the experiment (control group); P2, which we call the target value, is the level to which we hope to improve it through experiments; α and β are called type I error probability and type II error probability respectively, which are generally 0.05 and 0.2 respectively. Z is a quantile function of a normal distribution.

Because ABTest usually has at least 2 groups, the sample size required for the experiment is 2N.

2.4. I can’t calculate such a complicated formula. What should I do?

The Hawking platform now provides a gadget to calculate sample sizes by filling in a few numbers:

Description:

Current business daily ratio (baseline ratio) : For example, in the current experiment of account device login rate, the baseline ratio is the current device login rate, such as 15%.

Minimum rate of expected improvement (minimum detectable effect) : In our experiment, we chose a minimum rate of expected improvement of 5%. This means that if the pink button does increase device login rates by 5%, we hope the experiment will be confident enough to detect the difference.

Experimental group number: the normal AB experiment, is two, a control group, an experimental group.

Perfect Step 3: Calculate the minimum sample size

3. Determine indicators

In the experiment is generally compared with the experimental group and the control group in some indicators whether there are differences, of course, more often is to see whether the experimental group compared with the control group a certain indicator performance is better. Therefore, before conducting experiments, we should first determine the indicators that need to be compared in the experiment, and we should pay more attention to the indicators of proportion, such as click rate, conversion rate, retention rate, etc. In the subsequent significance analysis of the experiment, we also analyzed the proportion index.

Step 4: Determine experimental indicators

4, buried

When we determine the specific indicators that need to be analyzed, we need to carry out burying point design and collect relevant user behaviors for data analysis in subsequent processes, so as to draw experimental conclusions.

For ABTest, we need to know whether the current user is in the control group or the experimental group, so these parameters must be included in the burying point.

At present, the Hawking experiment platform supports the business side through the server-side buried point technology. When the business side experiments, the buried point data does not report the information of whether the user is in the experimental group or the control group, but it is suggested that the business side should bury the information of the scheme that the user is in, so that the data is more accurate and the analysis result is more reliable.

Perfect Step 5: Collect experimental data

3. Observation in the experiment

1. Observe whether the sample size is in line with expectations, such as whether the diverting flow of the experimental group and the control group is even. Under normal circumstances, the diverting data will not differ too much, if the difference is too large, it is necessary to analyze where the problem occurred.

2. Observe whether the user’s behavior is correct. After many experiments, we find that the burial point is wrong.

4. Post-experiment analysis

1. After we have done ABTest, we need to analyze the data to determine the effect of this experiment, so we need to conduct significance analysis of the experiment to see the significance difference of the experiment. If the results are not significant, there is no reference.

2. Significant difference is a statistical term. It is a statistical assessment of the difference in data. When drawing conclusions, P>0.05 is usually used to indicate that the difference is not significant. 0.01<P<0.05 indicated significant difference; P<0.01 indicates extremely significant difference.

3, when the data has a significant difference between means participation in comparing data from the differences of two different overall, the difference may be involved in comparing data from different subjects, such as compared to the middle-aged and the elderly, may also come from the experimental treatment caused fundamental traits to the subjects to change it (AB experiment we expect that, Therefore, the experimental data can be significantly different.

4. The formula for calculating the significance of proportion indicators is given below for your reference (independent sample T test) : To calculate the P value, we need to calculate the T value first, and the formula is as follows:


After calculating t value, convert T value into P value according to T value and degree of freedom n =N1 + N2-2. The calculation formula of Excel is given here: P =Tdist(t,n,1)

5. Do we need the business side to calculate such a complex significance calculation?

A: No, the Hawking platform already supports significance calculation of experimental indicators:


[Perfect Step 6: Index significance calculation]

6. It can be seen from 3 that significant difference does not necessarily mean that the experiment is effective, but may be caused by confounding factors, which requires further analysis of experimental samples to determine whether the experiment is affected by confounding factors.

Step 7: Determine the root cause of significance

7. Finally, through the analysis, the conclusion is given whether this experiment is effective and if so, how much improvement this experiment brings to the business side.

Step 8: Give experimental results

8. It is said that Hawking experiment platform will support real-time viewing of scheme shunting data and index data?

A: Yes, yes, it shouldn’t be long.

Five, the summary

How to do a perfect ABTest?

1, determine the control group and experimental group, it is best to do a single variable experiment, change only one variable at a time.

2. When shunting, try to exclude confounding factors, and generally adopt random shunting.

3. Check whether the flow meets the requirements of minimum sample size. If it does not meet the requirements, subsequent analysis cannot be carried out, and the experimental results are not credible.

4. Determine the comparison index of this experiment, that is, what should be used to measure the difference between programs?

5, accurate collection of user behavior data, which requires that the burial point must be correct.

6. Analyze the significance of the indicators. If the indicators are not significant, the experiment is invalid.

7. Determine the root cause of significance and exclude confounding factors leading to significance of experimental results.

8. Finally, the experimental conclusion is given: effective or ineffective.

For more content, please pay attention to vivo Internet technology wechat public account

Note: To reprint the article, please contact our wechat account: Labs2020.