AB Testing With Continuous Outcomes (And Horribly Misspecified Priors)
As a follow up to our previous post on Bayesian A/B testing, and by way of light entertainment, we decided to share one amusing mistake we’ve made when analyzing a spate of recent A/B tests.
This is our second post on Bayesian A/B testing. If you are interested in the topic, have a look at our previous post, or at Bayesian Methods For Hackers for a great and accessible treatment of the subject (and more!).
The data
Unlike in our previous blog post, where we looked at analyzing binary (conversion/no conversion) outcomes from an A/B test, in this case we would like to look at continuous outcomes. Let’s assume that we have carried out a properly randomized A/B test, and are interested in the effect of the treatment (the change being tested) on a continuous outcome, such as time on site or order value.
We can often model such data as log-normally distributed: all observations are greater than zero, and while the bulk of the data lie in a small range, there is also a fairly long right tail of extreme observations. For example, the data might look like this:
For analytical convenience we can transform log-normally distributed data into normally distributed data by taking the natural logarithm of the observations, giving us something like the following:
A normal distribution like this is described using two parameters: the mean \(\mu\) and the standard deviation \(\sigma\). While the relationship between the mean of the transformed data and the mean of the original data is slightly tricky, for simplicity we assume that all we care about here is estimating the difference between the \(\mu\) parameters of our test and control groups. (For those interested, this is a nice source for a more in-depth treatment).
To keep things simple, we’ll generate some synthetic data and pretend these are observations from the test; that way, we know the true values of the parameters and can evaluate our statistical procedure.
Let’s assume we have a 10% treatment group (where the mean is 1.2) and a 90% control group (with a mean of 1.0), giving us a 20% positive treatment effect in our log outcome.
Analysis with uniform priors
At first glance, it looks like we could model our data as a mixture of two normal distributions: we can imagine that there are two distributions, the test and control distributions, and our overall dataset is a mixture of the two:
More formally, letting \(\tau\) be 1 if an observation is in the test group, and 0 if in control, we have
\begin{equation} y \sim \mathcal{N}(\tau \mu_t + (1 - \tau)\mu_c, \sigma^2) \end{equation}
What this means is that we model the data as normally distributed with mean \(\mu_t\) if the datum is in the test group and \(\mu_c\) if in control group, and with standard deviation \(\sigma\).
To complete the model, we also need to specify our prior beliefs about what the parameters could be: a probability distribution that summarizes our beliefs about what the parameters are before we see any data.
The model will then combine the information in the prior and the information in the data together into a posterior distribution of what the parameters might be: a probability distribution describing the possible values of the parameters after we have looked at the data (but still given our priors).
As a first step, we’ll put uniform priors on the means, and an inverse-gamma prior on the standard deviation.
\begin{align}
\mu_t, \mu_c &\sim \mathcal{U}(0, 100) \
\sigma &\sim \text{InverseGamma}(1.1, 1.1)
\end{align}
This means that, a priori, we believe that the mean parameters \(\mu\) are equally likely to lie anywhere between 0 and 100. What this amount to is saying that ‘We don’t have any opinion about what the means are—except that they no less than 0 and no more than 100’.
(Roughly, this amounts to what is called an ‘uninformative prior’: a starting point for the analysis where we intentionally specify a vague prior that encodes less information we actually have about the problem. For a nice summary of priors in general you could go here is, and this is a useful discussion of the dangers of vague/flat/uninformative priors.)
Let’s set it up using PyMC.
Note that we also set up a treatment_effect
variable, which will directly give us posteriors for the quantity of interest, the difference between the test and the control mean (the treatment effect).
When we run the function above, we will obtain a samples from the posterior distribution of \(\mu_t\), \(\mu_c\), and \(\sigma\): likely values of the parameters given our data and our prior beliefs.
In this simple example, we can look at the means of the to get our estimates. (Of course, we should in general examine the entire posterior and posterior predictive distribution.)
Generating 5000 data points and fitting the model gives us the following posterior statistics:
Control Mean | Sigma | Treatment Effect | Treatment Mean | |
---|---|---|---|---|
Count | 7500.000000 | 7500.000000 | 7500.000000 | 7500.000000 |
Mean | 0.994957 | 0.400233 | 0.194541 | 1.189498 |
Std | 0.006070 | 0.004057 | 0.016286 | 0.015642 |
Min | 0.966622 | 0.386997 | 0.135829 | 1.128991 |
25% | 0.990763 | 0.397576 | 0.184093 | 1.179193 |
50% | 0.994837 | 0.400233 | 0.194276 | 1.189007 |
75% | 0.999000 | 0.402970 | 0.205679 | 1.199526 |
Max | 1.018842 | 0.413570 | 0.260544 | 1.243863 |
Excellent! We get accurate estimates of our quantities of interest. In this sense, this simple model gives us a credible result. (For brevity, I omit plotting the trace and posterior distributions.)
Adding informative priors
One of the great things about Bayesian methods is how easy it is to integrate informative priors. In general, using uniform priors is probably not a good idea: we often have a good idea of what, say, our average order value is, and we would like to incorporate that in our analysis.
In this case, let’s suppose we have good prior evidence (say, based on months of historical data) that the mean is somewhere around 0.8. This is lower than we observe in our sample: perhaps that’s because the test data were gathered in a period which for whatever reason was good for us (sale season!).
Still, we have good reason to believe that the mean should be centred around 0.8, not be much lower, and not be much higher. We can and express it in the model by putting a normal prior on the means: \begin{equation} \mu_t, \mu_c \sim \mathcal{N}(0.8, 1.0) \end{equation} In this case, we make our prior distribution fairly wide (standard deviation of 1). This will make utterly incredible estimates (such as a mean of 80) unlikely, but will not be so strong as to unduly influence the estimates in a reasonable range. In Python,
After running the model we get the following posterior statistics:
Control Mean | Sigma | Treatment Effect | Treatment Mean | |
---|---|---|---|---|
Count | 7500.000000 | 7500.000000 | 7500.000000 | 7500.000000 |
Mean | 0.995093 | 0.400223 | 0.193969 | 1.189063 |
Std | 0.005739 | 0.003945 | 0.019562 | 0.018818 |
Min | 0.974322 | 0.385248 | 0.127149 | 1.125384 |
25% | 0.991231 | 0.397655 | 0.182318 | 1.177901 |
50% | 0.995040 | 0.400231 | 0.195515 | 1.190218 |
75% | 0.998835 | 0.402898 | 0.207246 | 1.201171 |
Max | 1.015863 | 0.413722 | 0.250959 | 1.245565 |
Not much of a difference—with a weak prior, the influence of the prior is low, and we recover estimates very close to the model with uniform priors.
Strongly informative priors
Often, however, we’d like to use much stronger priors. It is not unreasonable to believe that very few A/B tests are either wildly successful or utterly catastrophic: most just don’t matter very much. We can formalize this by using a (much) stronger prior. In this case, we use a much narrower prior for the means: \begin{equation} \mu_t, \mu_c \sim \mathcal{N}(0.8, 0.01) \end{equation} What we hope this would give us is a much more conservative estimate of the treatment effect: the stronger the prior, the smaller the treatment effect estimates.
But running the model as specified above gives us a nasty surprise:
Control Mean | Sigma | Treatment Effect | Treatment Mean | |
---|---|---|---|---|
Count | 7500.000000 | 7500.000000 | 7500.000000 | 7500.000000 |
Mean | 0.941591 | 0.414281 | -0.057265 | 0.884326 |
Std | 0.005565 | 0.004222 | 0.010284 | 0.008971 |
Min | 0.919392 | 0.400419 | -0.093331 | 0.855266 |
25% | 0.938124 | 0.411471 | -0.064082 | 0.877947 |
50% | 0.941667 | 0.414246 | -0.057549 | 0.883966 |
75% | 0.945203 | 0.417116 | -0.050763 | 0.890239 |
Max | 0.960176 | 0.428461 | -0.021777 | 0.910394 |
The estimated treatment effect is negative. What gives?
Rubbish in, rubbish out
It seems that our inference is deeply flawed if we specify strong priors: the entire posterior of the treatment effect lies below zero, even though we know that the true treatment effect is positive. Why is this?
It seems that there are several factors at play:
- We have misspecified our priors. We specified a strong prior on the mean parameters when what we really wanted is to impose a strong prior on the treatment effect.
Ordinarily, that might not be so bad for our inference. In this case, however, we have two additional factors:
- The prior mean of the \(\mu\) distribution is lower than the sample mean.
- The treatment and control group are imbalanced: the treatment group is much smaller.
It looks like there is sufficient data in the control group to partially overcome the effect of the prior and shift the posterior mean towards one. Because the test group is much smaller, the influence of the prior is stronger, and the posterior moves towards the true value by a smaller amount. This results in a negative estimate for the treatment effect.
Additional testing reveals that (as suspected) the effect is reversed if the prior mean is higher than the sample means, and the problem disappears when groups assignments are balanced. As everywhere else, rubbish in, rubbish out: in this case, a misspecified prior leads to bad inference.
All this is, of course, blindingly obvious in hindsight, but not impossible to run into in practice. (Which is to say, I run into this when looking at one of our recent A/B tests. In this case, the treatment effect was overestimated—fortunately, I came to my senses before we made any decisions based on the spurious result.)
Fixing the model
Fortunately, it is quite easy to fix the model specification. Instead of estimating treatment and control means and then deriving a posterior for the treatment effect, we can estimate (and put suitable priors on) the base mean and the treatment effect directly. Denoting the treatment effect as \(\beta\) and its standard deviation as \(\sigma_\beta\):
\begin{align}
\mu_c &\sim \mathcal{N}(0.8, 1.0) \
\beta &\sim \mathcal{N}(0.0, \sigma_\beta^2) \
y &\sim \mathcal{N}(\mu_c + \tau\beta, \sigma^2)
\end{align}
In Python,
Running the model with weak-ish priors (\(\sigma_\beta = 1\)) gives:
Control Mean | Sigma | Treatment Effect | Treatment Mean | |
---|---|---|---|---|
Count | 7500.000000 | 7500.000000 | 7500.000000 | 7500.000000 |
Mean | 0.995448 | 0.400310 | 0.190892 | 1.186340 |
Std | 0.005714 | 0.004024 | 0.020015 | 0.019013 |
Min | 0.976170 | 0.387477 | 0.129810 | 1.116335 |
25% | 0.991638 | 0.397521 | 0.176442 | 1.172825 |
50% | 0.995335 | 0.400330 | 0.189525 | 1.185146 |
75% | 0.999170 | 0.403079 | 0.204043 | 1.199024 |
Max | 1.017452 | 0.416353 | 0.254601 | 1.249787 |
Running the model with a more conservative prior (\(\sigma_\beta = 0.01\)) gives a substantially smaller (but still positive) estimate of the treatment effect:
Control Mean | Sigma | Treatment Effect | Treatment Mean | |
---|---|---|---|---|
count | 7500.000000 | 7500.000000 | 7500.000000 | 7500.000000 |
mean | 1.009942 | 0.402598 | 0.040981 | 1.050923 |
std | 0.005537 | 0.004091 | 0.008809 | 0.009919 |
min | 0.990586 | 0.387995 | 0.010731 | 1.014541 |
25% | 1.006405 | 0.399807 | 0.035244 | 1.044171 |
50% | 1.010280 | 0.402453 | 0.040800 | 1.050996 |
75% | 1.013555 | 0.405243 | 0.046558 | 1.057476 |
max | 1.028570 | 0.418111 | 0.074561 | 1.089143 |
This is what we were after. Specifying an informative prior on the treatment effect directly manages to accurately express our prior beliefs. (Note that this is just Bayesian regression of our variable of interest on an intercept and an indicator variable for group assignment.)
Takeaway points
Again, all of this is very obvious: no method is magical, mistakes are easy to make, and there is no substitute for care (and thorough model checking).
If you find this interesting (or believe us simpletons in need of an education), we’re hiring!
Discuss this post on Hacker News