Bayesian AB Testing

by Maciej Kula on Saturday, 10 May 2014

Click here if you are looking for our interactive A/B testing inference machine.

Otherwise, read on!

A/B testing is an excellent tool for deciding whether or not to go ahead with rolling out an incremental feature.

To perform an A/B test, we divide users randomly into a test and control group, then serve the new feature to the test group while the control group continues to experience the current version of the product. If our randomization procedure is correct, we can attribute any difference in outcomes (for example, conversion rate) between the two groups to the change we our testing without having to account for other sources of variation affecting users’ behaviour.

Before acting on the results, we must understand the likelihood that any performance differences we observe are due merely to chance rather to the change we are testing. For example, it is perfectly possible to obtain different heads/tails ratios between two fair coins if we only conduct a limited number of throws. In the same manner, it is possible for us to see a change between the A and B branches even though in truth the underlying user behaviour is the same.

At Lyst, we use Bayesian methods to do this. We think this helps us avoid some common pitfalls of statistical testing and makes our analysis easier to understand and communicate to non-technical audiences. Read on to see why (or jump straight to our interactive A/B testing inference machine here).

Bayesian inference

The essence of Bayesian methods consists in identifying our prior beliefs about what results are likely, and then updating those according to the data we collect.

For example, if our conversion rate is 5%, we may say that it’s reasonably likely that a change we want to test could improve that by 5 percentage points—but that it is most likely that the change will have no effect, and that it is entirely unlikely that the conversion rate will shoot up to 30% (after all, we are only making a small change).

As the data start coming in, we start updating our beliefs. If the incoming data points point to an improvement in the conversion rate, we start moving our estimate of the effect from the prior upwards; the more data we collect, the more confident we are in it and the further we can move away from our prior. The end result is what is called the posterior—a probability distribution describing the likely effect of our treatment.

For example, we may want to use Bayesian inference to analyze the results of an A/B test where the result variable is conversion rate. We know that our current conversion rate is around 5%, and that very high conversion rates (above 20%) are extremely unlikely. To formalize this belief we could say that possible conversion rates are described by the Beta distribution with parameters \(\alpha=3\) and \(\beta=50\):

We can then simulate observations from a distribution with a 20% conversion rate and see our estimate shift from the prior to our posterior as we gather an increasing number of observations:

Intuitively, the updating process will more readily accept estimates consistent with the prior (if we believe ex ante that a 10% conversion rate is perfectly likely, our posterior will move there readily after a small number of observations), but will require more data to accept estimates that are less probable according to the prior.

This is all very straightforward: we start with a prior belief, and then update it in line with the incoming data. Most people, however, use a different approach to inference.

Significance testing

The more popular alternative to Bayesian inference is classical significance testing. Significance testing proceeds by formulating a so-called null hypothesis—a proposition expressing the belief that the observed difference in outcomes is merely due to chance (while in reality the treatment has no effect)—and then checking if the data we gathered provide sufficient support to reject it.

One way of going about this is as follows. If the null hypothesis is true, we know that our data should behave in a certain way. In particular, we can formulate a statistic (a number derived from our data using a formula) which—assuming the null hypothesis is true—behaves in a certain way (according to your favourite variety of the Central Limit Theorem).

For example, if we are measuring the difference in conversion rates in the two groups (and the null hypothesis is true), then the following statistic

\begin{equation} z = \frac{p_T - p_C}{\sqrt{p\left(1 - p\right)\left(\frac{1}{n_T} + \frac{1}{n_C}\right)}} \end{equation}

will be approximately normally distributed as the number of observations grows large (where \(p_T\) is the observed conversion rate in the test group, \(p_C\) the observed conversion rate in the control group, \(p\) is the conversion rate in the test and control group combined, and \(n_T\) and \(n_C\) are the numbers of observations in the test and control groups).

What this means is that—if the null hypothesis is true—our \(z\)-statistic will follow the standard normal distribution. Now, if our calculated \(z\)-statistic falls squarely outside the standard normal distribution, we have reason to believe that the null hypothesis is not true—in other words, that we can reject the null hypothesis.

We usually say that if the value of the \(z\)-statistic falls outside of the range where 95% of the values from a standard normal distribution fall, we reject the null hypothesis at the 5% significance level; if it falls outside the 99% range, we reject the null at the 1% level. If it does fall outside that range, we fail to reject the null at that level.

One implication of this is that if the null were true, the value of our statistic would still fall outside the 95% interval 5% of the time (by definition), and so we would erroneously reject the null 5% of the time. This is what is usually meant by saying that a test is significant at the 5% level: we rejected the null based on our data (but this would happen 5% of the time even if the null were actually true).

Why we prefer Bayesian methods

We prefer Bayesian methods for two main reasons. Firstly, our end result is a probability distribution, rather than a point estimate. Instead of having to think in terms of \(p\)-values, we can think directly in terms of the distribution of possible effects of our treatment. For example, if only 2% of the values of the posterior distribution lie below 0.05, we have 98% confidence that the conversion rate is above 0.05; if 70% of the values lie above 0.1, we have 70% confidence that the conversion rate is above 0.1. This makes it much easier to understand and communicate the results of the analysis (let alone integrate into more complex objective functions).

Secondly, using an informative prior allows us to alleviate many of the issues that plague classical significance testing. In particular, Bayesian methods are robust to two important problems that crop up often in A/B testing.

Repeated testing

The first such problem is repeated testing: running and re-running of our data analysis on the data as they come in. We may have good reasons to do so: we would prefer upgrade our users to the better version as soon as possible (if we know that version B is much better, we should switch over as soon as we can and not waste weeks gathering data).

However, doing so severely weakens the guarantees we get from significance testing. Let’s suppose that we use the conventional 5% significance level, and so we expect to see a false positive result in 5 out of every 100 tests, and then repeatedly run the test on successive batches of data from in our test, stopping the test at the first sign of significance.

The first time we run the test, there is a \(0.05\) probability of getting a false positive; the probability that we do not get a false positive is \(0.95\). On the second test, the probability we do not get a false positive on that test is still \(0.95\)—but the probability that we did not get a false positive in both of the tests we run is now \(0.95 * 0.95 = 0.9025\). On the third test, this becomes \(0.95 * 0.95 * 0.95 \approx 0.857\)—\(0.95^n\) on the \(n\)-th test. Supposing we are really impatient and peek ten times, the probability that we do get a false positive is \(1 - 0.95^{10} \approx 0.40\): a very far cry from the 5% probability we expected.

Usually, we would not actually run our tests on independent samples of data from the test, but rather run repeated tests on a growing dataset as the data comes in. There, the problem is less pronounced, but still exists: a quick numerical simulation suggests that running 10 such tests at the 5% significance level would give us a false positive rate of around 20%. The problem gets worse as we increase the frequency of testing, until, at the limit, running a test with every incoming datapoint will give us an extremely high false positive rate. This might be a particularly big problem for analytics providers offering real-time significance values as part of their analytics dashboards.
Low base rate problem

The second problem is significance testing where the chance of succeeding is low. It is reasonable to suppose that among all the possible changes we could test very few measurably improve our results: we could easily write hundreds of variations of one particular piece of text on the website (‘Don’t miss out, click now!’, ‘Click now to access exclusive discounts’), but very few of them will actually lead to marked changes in user behaviour.

To illustrate, suppose we believe that 1 in 20 of the variations we test is a true improvement. If we test 100 changes, we should expect that 5 of them will be successful: suppose we identify all of those. But if we are testing at the 5% significance level, we fill erroneously find \(0.05 * (100 - 5) = 4.75\) additional tests to be successes—just by chance. This means that half of all the tests we declared to be successful were false positives: not a great result.

This is despite the fact that there is nothing wrong with our statistical method: it’s just that there are far more failures than successes, and a large number of null rejections will be false positives.

(See also this white paper for an excellent alternative treatment of these problems.)

Both of these problems are much less pronounced when we use an informative prior. Intuitively, this is because a Bayesian posterior (or its mean) is far more stable than the corresponding frequentist (significance testing) estimate. Especially at small sample sizes, the posterior is close to the prior.

We can see this by way of numerical simulations. To do so, we draw two samples from a Bernoulli distribution (yes/no, tails/heads), compute the \(p\) parameter (probability of heads) estimates for each sample, and then take their difference. We repeat this after every observation until we accumulate 500 observations. Ideally, the estimates should remain close to 0 at all times: after all, we are drawing values from the same distribution. In reality, this is what we obtain:

For small samples sizes, frequentist estimates are extremely variable; Bayesian estimates remain close to the prior.

Interestingly, we see a similar story for \(p\)-values. In the frequentist plot below, we draw the \(p\)-value for the null hypothesis that the two samples come from the same distribution. In the Bayesian equivalent we plot a pseudo-\(p\)-value, computed as double the probability that 0 lies outside the posterior distribution of the difference between the two samples (this is very foreign to the Bayesian way of thinking, and we only do produce something comparable to \(p\)-values). Whenever the plotted value goes below the 0.05 line, we obtain statistical significance (in this case, something we can interpret as a false positive).

The \(p\)-values are extremely volatile. Many lines cross the 0.05 line, indicating the number of false positives we get. In contrast, we have a much smaller chance of false positives with an informative prior. (Another interesting story about these plots is that false positives in small samples are far more pernicious, because they suggest large effect sizes—as the number of samples grows larger the effect size estimates get smaller, so a false positive will have less potential to negatively affect our decision-making.)

It is worth noting that there is nothing magical about Bayesian methods. The advantages described above are entirely due to using an informative prior. If instead we used a flat (or uninformative) prior—where every possible value of our parameters is equally likely—all the problems would come back.

It is true that there are many methods of correcting these problems in hypothesis testing. We could, for instance, use some sort of multiple comparison adjustment (like the Bonferroni correction). We could arbitrarily adjust our significance level. Or try to marry frequentist approaches with some sort of prior on the estimates, like replacing our sample estimates of \(p\) by

\begin{equation} p = \frac{\alpha + \mathrm{no. successes}}{\beta + \mathrm{no. failures}} \end{equation}

(where \(\alpha\) and \(\beta\) correspond to the Beta distribution parameters) and use that to calculate the \(z\)-statistic.

But none of them seem as elegant and straightforward as specifying a reasonable prior—and none of them provide as good a description of uncertainty inherent in the analysis. They are also far less flexible: more advanced Bayesian methods (such as Markov Chain Monte Carlo methods) allow straightforward inference using even very complex models.

You too should use Bayesian methods

Bayesian inference is robust and fun. You too should try it out.

For a small taste, head over to our interactive Bayesian inference machine here.

If you would like to learn more straight away, this is a fantastic introduction to the topic, using Python and PyMC.

Follow the discussion on Hacker News.