# How a Simple Bayesian Test Could Have Rescued a Famous Clinical Trial

One of the features that we have recently added to JASP is a Bayesian “A/B test”, that is, a test for the equality of two binomial proportions. This test is especially popular in the analysis of clinical trial data, where the proportion of medical successes (or failures) from a treatment group is contrasted against those from a control group.

The first Bayesian A/B test –also implemented in JASP, under “Frequencies -> Bayesian Contingency Tables”– was developed by Jeffreys (1935), but it is arguably not the most elegant attempt; for instance, Jeffreys’s test assumes that knowledge of one proportion does not affect knowledge of the other proportion (i.e., the proportions are assigned independent prior distributions). A Bayesian A/B test more in line with Jeffreys’ own statistical philosophy was proposed by Kass & Vaidyanathan (1992) — henceforth “KV”. The KV test is a Bayesian logistic regression with “condition” (i.e., treatment vs. control) coded as a dummy predictor. Last year we created an R package, abtest, that implements the KV test (Gronau et al., 2019), and the functionality of this package has now been made available in JASP.

The Bayesian advantages of the KV test are discussed in the abtest paper. Here we’ll just showcase the use of the AB test in JASP with a concrete example.

## A Famous Clinical Trial

In a must-read article, Robinson (2019) provides a vivid example of a statistical conundrum surrounding a famous clinical trial. As stated by Robinson, “The following example discusses a conclusion of great practical importance which was reported to be “statistically significant” but subsequently found not to be “statistically significant.””.

“The data from Rossouw et al. (2002, Table 2) shown in Table 1 arose from a longitudinal study of the effects of hormone replacement therapy (HRT). Fisher’s exact test on this 2 × 2 table gives a two-tailed p-value of 0.0369. This is less than 0.05, so the data were considered to provide “statistically significant” evidence that HRT increases the rate of coronary heart disease (CHD). The finding was worldwide news. (…) The study was terminated on the grounds that it would be unethical to further expose participants to the apparently increased risk of CHD. Some additional data arrived after the decision were [sic] made to terminate the study, as discussed in Manson et al. (2003). CHD was observed in 24 additional subjects on HRT and 25 additional subjects on placebo. The final (2003) version of the data is shown in Table 2. For this version of the data, the two-tailed p-value is 0.0771, so the evidence that HRT increases the rate of CHD was no longer “statistically significant.” . (Robinson, 2019, p. 243)

The Robinson tables are here: ## Bayesian Reanalysis of the 2002 Data that Prompted Trial Termination

First we open the data file containing four numbers in JASP: We then go to “Frequencies” and select “Bayesian A/B test”. After dragging each of the four variables into the specification boxes, the input panel looks as follows: Before we discuss the output, note that Group 1 is the placebo group, Group 2 is the HRT group, and “successes” are defined as the absence of CHD. If HRT increases the risk of CHD, the successes in the placebo group should be higher than those in the HRT group. For a rough impression of the data we can tick “Descriptives” and obtain the following table: It is evident that (luckily) the incidence of CHD is very low. The proportion of participants without CHD is 0.9849 in the placebo group and 0.9807 in the HRT group, so there are fewer successes in the HRT group (i.e., more cases of CHD). Is this difference of 0.42% statistically compelling? The p-value tells us that the test statistic, or more extreme forms, is not very likely if the null hypothesis were true; this does, however, not imply that the data provide evidence against H0. In JASP, the default Bayesian A/B analysis yields the following table output: This table shows the predictive performance of three competing hypotheses. The first row shows the null hypothesis. It was assigned a prior model probability of 0.50, and the data have not caused that to change much: the posterior model probability is 0.4989. The second row shows the one-sided alternative hypothesis that predicts the effect of HRT to be harmful (i.e., fewer successes in the HRT group). This hypothesis is qualitatively in line with the data, and its prior probability of 0.25 is increased almost by a factor of two, to 0.4915. The third row shows the complementary alternative hypothesis, the one that predicts the effect of HRT to be helpful. This possibility is contradicted by the data, and the prior model probability of 0.25 is decreased to a posterior model probability of 0.0096.

In sum, this analysis shows that the data provide only weak evidence for the hypothesis that HRT is harmful (versus the null hypothesis): the data are only 1.97 times more likely under the “HRT causes CHD” hypothesis than under the null hypothesis. These data do not provide good reason to stop the trial, and the conclusion that these data warrant the rejection of the null hypothesis is premature. This is yet another demonstration of the phenomenon that p-just-below-.05 findings are evidentially weak.

Now if we choose to ignore the null hypothesis and assess the probability that the effect is harmful rather than helpful, the conclusion is very different. This can already be intuited by plotting the prior and posterior distribution of the log odds ratio under a two-sided alternative hypothesis (one tick in the JASP GUI): The easiest way to obtain the Bayes factor for the direction of the effect (given that it exists!) is to change the prior model specification under “Advanced Options”, like this: This then yields the following output table: indicating that the data are almost 50 times more likely under the hypothesis that HRT is harmful rather than helpful. This is strong evidence, but it is emphatically not the answer to the clinical question of interest.

In conclusion: the data provide some evidence against the null hypothesis, but it is “not worth more than a bare comment” (Jeffreys, 1939; see also this cartoon ). From this Bayesian perspective, the trial should not have been stopped.

## But What About the Prior?

The above analyses used two different “priors”. The first one is the prior on the model plausibility. As demonstrated above, this can be used to exclude particular models (by setting the prior plausibility to zero). The prior model plausibility does not, however, affect the evidence, that is, the relative predictive performance of the rival models. The second prior involves the expectations of effect size under the alternative hypotheses. As explained in the abtest paper, the default setting is to assign the log odds ratio a standard normal distribution. This distribution can be inspected by ticking the box “Prior”, which yields: In order to assist intuition, this prior distribution may also be shown on a different scale. JASP offers a range of different options: For instance, selecting “p1&p2” yields a heatmap: This representation makes it clear that in the KV test (and unlike the original Jeffreys 1935 test), the two proportions are dependent.

The default analysis specifies a standard normal prior distribution on the log odds ratio scale because it seems roughly consistent with the size of effects that are often reported in the medical literature. Of course, JASP allows users to change this prior distribution by adjusting the mean and standard deviation of the prior on the log odds ratio: Changing the prior expectation (as formalized by the prior distribution) is often sensible when applying the A/B test to problems in industry; for instance, the effect of website changes on conversion rates is often known to be very modest. In the next version of JASP we will provide the option to conduct an automatic robustness analysis, where parameters mu and sigma are independently varied across a user-specified range (for an example see this blog post).

## Bayesian Reanalysis of the Final 2003 Trial Data

The 2003 trial were only a little different from the 2002 trial data. In JASP we first load the 2003 data: We drag the variables into their specification boxes, and obtain the following descriptive information: The placebo group still boasts a higher number of “successes” (non-CHD cases). It’s advantage is reduced ever so slightly from the earlier 0.42% to 0.40% — a miniscule change. Yet, using a threshold alpha value of .05, in the 2002 version of the data the result was statistically significant and prompted the trial to be terminated; in the 2003 version of the data –which is only very slightly different– the result is not statistically significant, and, presumably, the trial would not have been stopped at that point. One problem here (but by no means the only one) is that a continuous p-value is discretized in order to drive one of two radically different “decisions”: reject H0 or maintain H0. If the p-value is near the decision threshold, it is more prudent to withhold judgment. We agree with Robinson (2019, p. 246) who writes: “In my opinion, inference (as opposed to merely choosing) must always allow the additional option of stating that the evidence is not sufficiently strong for a reliable choice to be made between the hypotheses.”

The similarity between the 2002 data and the 2003 data is underscored by plotting the prior and posterior distribution for the log odds ratio under a two-sided alternative hypothesis: The model comparison table shows the following result: The first row shows that the data have made the null hypothesis somewhat more plausible than it was before; the second row shows that the same is true for the alternative hypothesis that states the effect of HRT to be harmful. The Bayes factor in favor of the “HRT is harmful” hypothesis over the null hypothesis is 1.0512, which means that the data did virtually nothing to change one’s reasonable beliefs about these two rival models.

However, just as before, one may ignore the null hypothesis and ask the question “if HRT has an effect, is it helpful or harmful?” Setting the prior model probability for the null hypothesis to 0 produces the following output table: This shows that the data are more than 26 times more likely under the hypothesis that HRT is harmful rather than helpful. This echoes the result from the 2002 data.

## Sequential Analysis

We now analyze the 2002 and 2003 data in sequence. The input data look like this: We then conduct the same analysis as before, but now select the option “Sequential analysis” This results in the following graph: The probability wheels on top of the figure visualize the prior and posterior model probabilities. In the figure itself, the leftmost dots, situated at n=0, indicate the prior model probabilities; the rightmost dots indicate the posterior model probabilities. This particular experimental design is actually not ideal for demonstrating the usefulness of the sequential analysis, as the total sample size (which determines the x-axis) remains the same between the 2002 and 2003 versions (a fixed number of people are monitored and they either develop CHD or not). This peculiarity is why the rightmost dots are on top of one another instead of side by side.

## Conclusions

A default Bayesian A/B test of the famous HRT clinical trial, executed in JASP within mere seconds, showed that:

1. The evidence against the null hypothesis was never worth more than a bare comment, not in the 2002 version (where p was .0369, statistically significant) and neither in the 2003 version (where p was .0771, statistically not significant); the experiment therefore appears to have been stopped without a good statistical reason.
2. For both the 2002 and 2003 versions of the data there is strong evidence that the effect of HRT, should it exist, is harmful rather than helpful. This analysis presupposes that the null hypothesis can be ignored, something that is both inconsistent with the data (i.e., the null hypothesis predicts the data relatively well) and the scientific question of interest (i.e., the experiment was designed to convince a skeptical audience that there may or may not be an effect; the skeptical audience will entertain the null hypothesis, or a very similar peri-null hypothesis).
3. In the Bayesian framework, new data can continually update knowledge, without the need for advance planning — the incoming data mechanically transform the prior distribution to a posterior distribution and a corresponding Bayes factor, as uniquely dictated by Bayes’ theorem (see also Wagenmakers et al., 2019).
4. The above analysis (which can be further enriched by robustness analysis and informed prior distributions) provides information that a p-value does not. The fact that the medical community largely ignores the Bayesian approach presents a missed opportunity for doctors and a potential health risk for patients.

#### References

Gronau, Q. F., Raj A., Wagenmakers, E.-J. (2019). Informed Bayesian inference for the A/B test. Manuscript submitted for publication.

Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability. Proceedings of the Cambridge Philosophy Society, 31, 203-222.

Jeffreys, H. (1939). Theory of Probability. Oxford: Oxford University Press.

Manson, J. E., Hsia, J., Johnson, K. C., Rossouw, J. E., Assaf, A. R., Lasser, N. L., Trevisan,M., Black, H. R., Heckert, S.R., Detrano, R., Strickland, O. L., Wong, N. D., Crouse, J. R., Stein, E., & Cushman, M. (2003). Estrogen plus progestin and the risk of coronary heart disease. The New England Journal of Medicine, 349, 523-534.

Kass, R. E., & Vaidyanathan, S. K. (1992). Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. Journal of the Royal Statistical Society, Series B, 54, 129-144.

Robinson, G. K. (2019). What properties might statistical inferences reasonably be expected to have?—Crisis and resolution in statistical inference. The American Statistician, 73, 243-252.
Open access: https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1415971

Rossouw, J. E., Anderson, G. L., Prentice, R. L., LaCroix, A. Z., Kooperberg, C., Stefanick, M. L., Jackson, R. D., Beresford, S. A. A., Howard, B. V., Johnson, K. C., Kotchen, J. M., & Ockene, J. (2002). Risks and benefits of estrogen plus progestin in healthy postmenopausal women: Principal results from the women’s health initiative randomized controlled trial. Journal of the American Medical Association, 288, 321-333.

Wagenmakers, E.-J., Gronau, Q. F., & Vandekerckhove, J. (2019). Five Bayesian intuitions for the stopping rule principle. Manuscript submitted for publication.

#### Like this post? #### Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam. EJ guides the development of JASP. #### Quentin Gronau

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam. At JASP, he is responsible for the t-tests and the binomial test. #### Akash Raj

Akash is a software developer at JASP. He is responsible for the implementation of UI elements.