An Introduction to Jeffreys’s Bayes Factors With the SumStats Module in JASP: Part 1

In this blog post we elaborate on the ideas behind Harold Jeffreys’s Bayes factor and illustrate this test with the Summary Statistics module in JASP.

In a previous blog post we discussed the estimation problem, where the goal was to infer, from the observed data, the magnitude of the population effect. Before studying the size of an effect, however, we arguably first need to investigate whether an effect actually exists. Here we address the existence problem with a hypothesis test and we emphasize the difference between testing and estimation.

The outline of this blog post is as follows: Firstly, we discuss a hypothesis proposed in a recent study relating fungal infections to Alzheimer’s disease. This hypothesis is then operationalized within a statistical model, and we discuss Bayesian model learning in general, before we return to the Alzheimer’s example. This is followed by a comparison of the Bayes factor to other methods of inference, and the blog post concludes with a short summary.

Running example

To address the existence problem we compare two hypotheses. An example of a hypothesis is

    \[\mathcal{H}_{0}: \text{All Alzheimer's patients have fungal infections in their brains.}\]

This hypothesis was proposed by Pisa, Alonso, Rabano, Rodal and Carrasco (2015), who conducted an experiment to assess the evidence for this null hypothesis. The alternative hypothesis is denoted by \mathcal{H}_{1} and implies that not all Alzheimer’s patients have fungal infections. Note that the hypotheses are statements about the population consisting of all (i.e., past, present, and future) patients suffering from Alzheimer’s disease. As we will never have this population at hand, we cannot check every patient for fungal material, and we can never be 100% certain that the hypothesis \mathcal{H}_{0} holds true.

Statistical models: Reasoning from the general population to all possible outcomes

Due to sampling error, statements about the population such as \mathcal{H}_{1} do not transfer directly to the relatively small samples of patients. To account for this error we assume a statistical model for the data. The data will consist of the number of successes S and the number of failures F=n-S, where S is the number of Alzheimer’s patients where fungal material are found from a sample of size n. The standard model for this situation is known as the binomial distribution which depends on the sample size n and the binomial rate parameter \theta.

Linking the hypotheses to the experiment

This model allows us to make predictions from the hypothesized general population about all possible outcomes. For instance, when n = 10 the potential outcomes are S = 0, 1, 2, \ldots, 8, 9, 10 number of successes. Within the binomial model, the null hypothesis is operationalized as

    \[\mathcal{M}_{0}: S \text{ is distributed according to a binomial: Bin(n,} \theta =1),\]

which implies that if the hypothesis \mathcal{H}_{0} that is concerned with the population holds true, then every Alzheimer’s patient in the sample has fungal material in his or her brain. In other words, before the experiment is run and if the null holds, then the outcomes S=0, 1, 2, \ldots, 8, 9 are predicted to occur with 0% chance, while S=10 is predicted to occur with 100% chance.

On the other hand, the alternative hypothesis is operationalized as

    \[\mathcal{M}_{1}: S \text{ is distributed according to a binomial: Bin(n, } \theta ) \text{ where } \theta \in (0, 1)\]

by which we mean that there is some fixed, but unknown, proportion \theta of the Alzheimer population that have fungal material in their brains. For instance, if n=10 and \theta=0.5, then there is a 24.6% chance to observe S=5, a 20.5% chance to observe S=6, and a 11.7% chance to observe S=7 successes.

In fact, there’s a small chance of 0.098% to observe S=0 successes. Similarly, if \theta=0.7 then there’s only a 10.3% chance to see 5 successes, a 20% chance to see 6 successes, and a 26.7% chance to see 7 successes. Likewise there’s a small, but non-zero, chance to see S=0 successes out of n=10, see the figure below.

The alternative hypothesis implies that any \theta strictly between zero and one is possible. Note that we used the model to reason from the hypothesized general population to all possible outcomes, denoted with a capital S before the experiment is run. Furthermore, the predictions are stochastic meaning that they are expressed as chances.

Bayesian statistics: Reasoning from a particular observation to the general population

Before the data are observed we also set a prior model probability P(\mathcal{M}_{0}), a number between zero and one, that expresses our belief about the null hypothesis. For instance, a skeptic might choose P(\mathcal{M}_{0}) = 0.25 and, therefore, has more confidence in the alternative, as P(\mathcal{M}_{1})=1 - P(\mathcal{M}_{0}) = 0.75. Formalizing the beliefs of the hypotheses as prior model probabilities allows us to use Bayes rule to compute posterior model probabilities P( \mathcal{M}_{0} \mid d) and P( \mathcal{M}_{1} \mid d) that represent the updated probabilities for the two models after observing the data d consisting of a particular realized number of successes s of the experiment based on a sample of size n. This leads to the crucial equation

    \[\underbrace{ \frac{ P( \mathcal{M}_{0} \mid d) }{ P( \mathcal{M}_{1} \mid d)}}_{\text{Posterior model odds}} = \text{BF}_{01}(d) \underbrace{ \frac{ P( \mathcal{M}_{0}) }{ P( \mathcal{M}_{1})}}_{\text{Posterior model odds}} ,\]

where \text{BF}_{01}(d) refers to the Bayes factor in favor of the null over the alternative provided by the data d. As we explain below, the Bayes factor quantifies the relative predictive performance of the rival hypotheses.

Bayesian model learning

For instance, the skeptic’s prior beliefs of P(\mathcal{M}_{0}) = 0.25 and P(\mathcal{M}_{1}) = 0.75 implies a prior model odds of one-to-three, that is, P(\mathcal{M}_{0}) / P(\mathcal{M}_{1}) = 1/3. This means that a Bayes factor of \text{BF}_{01}(d) > 3 is needed to convince the skeptic that the fungi hypothesis is more likely to be true than not. With the prior model odds and a Bayes factor at hand, the posterior model odds follow from a simple multiplication. As we only entertain two models, and if the posterior model odds is of the form numerator / denominator, then the posterior model probabilities can be retrieved as

    \[P(\mathcal{M}_{0} \mid d ) & = \frac{ \text{numerator} }{ \text{numerator} + \text{denominator}},\]

    \[P(\mathcal{M}_{1} \mid d ) & = \frac{ \text{denominator} }{ \text{numerator} + \text{denominator}}.\]

For instance, if \text{BF}_{01}(d) = 2.5 then the posterior model odds will be two-and-a-half-to-three and leads to P(\mathcal{M}_{0} \mid d) = 2.5/(2.5+3)=0.45 and P(\mathcal{M}_{1} \mid d) = 3/(2.5+3)=0.55. While if \text{BF}_{01}(d) = 12 then the posterior model odds will be four-to-one resulting in P(\mathcal{M}_{0} \mid d) = 4/(4+1)=0.8 and P(\mathcal{M}_{1} \mid d) = 1/(4+1) = 0.2.

Note that if P(\mathcal{M}_{0})=1, we automatically have P(\mathcal{M}_{1})=0, which implies that we have no doubt that the null holds true. For these prior model probabilities we automatically get P(\mathcal{M}_{0} \mid d) =1 and P(\mathcal{M}_{1} \mid d)=0 regardless of the data that are observed. Hence, setting P(\mathcal{M}_{0})=1 (or P(\mathcal{M}_{0})=0) implies that we are not open to learn from data, and we therefore assume that P(\mathcal{M}_{0}) is strictly between zero and one.

It’s worth mentioning that Bayes factors are not the same as posterior model odds, and that Bayes factors do not depend on the prior model probabilities P(\mathcal{M}_{0}) and P(\mathcal{M}_{1}). At the time of writing, there’s no general option yet to specify prior model probabilities in JASP, but the program does provide Bayes factors. One of the advantages of the Bayes factor is that we can focus on the evidence provided by the data, and let everybody update their own model priors.

Bayes factor as the relative predictive adequacy of one model over the other

We are trying to update our knowledge (i.e., the prior model odds) by considering the predictive performance of the rival hypotheses in light of the observed data. The relative predictive performance of these hypotheses is known as the Bayes factor. In this scenario, it is defined as follows

    \[\text{BF}_{01}(s, n) = \frac{ \text{Bin}(s \mid n, 1)}{ \int \text{Bin}(s \mid n, \theta) \pi (\theta) \text{d} \theta },\]

where \text{Bin}( s \mid n, 1) in the numerator is a number, which is retrieved by evaluating the binomial likelihood at the null hypothesis \mathcal{H}_{0}: \theta=1 given the observed number of successful counts s and sample size n. Similarly, \text{Bin}( s \mid n, \theta) in the denominator is a number for each possible \theta between zero and one within the alternative model.

In JASP, this Bayes factor is equivalently expressed in terms of the observed number of successes s and failures f

    \[\text{BF}_{01}(s, f) = \frac{ \text{Bin}(s \mid s+f, 1)}{ \int \text{Bin}(s \mid s+f, \theta) \pi (\theta) \text{d} \theta },\]

since the sum of the successes and failures equals the size of the samples n.

The function \pi(\theta) is also referred to as a prior, and could be thought of as the experimenter’s uncertainty about the unknown \theta within the model \mathcal{M}_{1}, which we distinguish from the experimenter’s between-model uncertainty expressed by P(\mathcal{M}_{0}) and P(\mathcal{M}_{1}).

According to my reading of Jeffreys, in his Bayes factor construction \pi(\theta) does not reflect prior knowledge; instead, \pi(\theta) can best be viewed as a weighting function necessary to convert the likelihood function of the alternative model into a single number. More specifically, for \mathcal{M}_{1} the integral sign \int implies that at each possible \theta in (0, 1) the likelihood \text{Bin}(s \mid s+f, \theta) is multiplied with the weight \pi(\theta) and, subsequently, summed. As such, \int \text{Bin}(s \mid s+f, \theta) \pi (\theta) \text{d} \theta is a weighted average, i.e., the marginal likelihood, and leads to a number. As a result, the Bayes factor is a ratio of two numbers and therefore a number itself. For the binomial model, we typically use for \pi(\theta) the beta distribution with hyperparameters a and b. In JASP a = 1 and b =1 is set by default, but a user can change these values.

Calculating the default Bayes factor using the Summary Stats module

To get a better understanding of \pi(\theta), we first activate the Summary Stat module in JASP by clicking the “+” sign, and then go to “Frequencies” followed by “Bayesian Binomial Test” and tick “Prior and posterior” and “Additional info” under ”Plots”. The default beta prior for \theta with a = 1 and b = 1 is shown to be the uniform distribution.

To see Bayesian model learning in action, we first change the test value to 1 corresponding to the null hypothesis we would like to test, and enter the observed number of successes s=10 and f=n-s=0 failures as reported in Pisa et al. (2015, p. 5).1

Note that the results of the analysis appears immediately, and after selecting \text{BF}_{01} in the user interface, JASP reports a Bayes factor of \text{BF}_{01}(s=10, f=0) = 11 in favor of the null over the alternative, which provides direct evidence in favor of the experimenters’ working hypothesis over the alternative and leads to the the following plot.

In this figure the Bayes factor is visualized using the so-called Savage-Dickey density ratio which states that under certain conditions

    \[\text{BF}_{01}(d) = \frac{ \pi ( \theta_{0} \mid d)}{\pi(\theta_{0})} .\]

This equation implies that \text{BF}_{01}(d) is equal to the posterior for the parameter \theta within the alternative model divided by the prior at the test point, in this case, \theta_{0}=1. Hence, for the test the focus in these prior and posterior plots should be on the relative heights of the dots. In this case, the posterior evaluated at \theta=1 is at height 11, whereas the prior evaluated at \theta=1 is at height 1.

To highlight that the Bayes factor does not equal the posterior model probabilities, note that for the skeptic a Bayes factor of \text{BF}_{01}(s=10, f=0) = 11 implies that her prior model odds of one-to-three are updated to eleven-to-three resulting in P(\mathcal{M}_{0} \mid d ) = 11/(11+3) = 0.79 and P(\mathcal{M}_{1} \mid d) = 3/(11+3)=0.21. Hence the skeptic cannot rule out either hypothesis after observing the data, but she can conclude that the null hypothesis is now more likely than it was before.

How the hyperparameters of the beta distribution influence the Bayes factor

The analysis based on a = b = 1 has some interesting properties, as with this setting –and as long as we see no failures– each additional success increases the evidence for the null over the alternative by one. For instance, entering s = 11 successes in JASP instead of s=10, we immediately get \text{BF}_{01}(s=11, f=0) = 12.

In general, with b=1 and for s=n we get \text{BF}_{01}(s=n, f=0) = 1/a \times n + 1. For instance, if a=0.125 then each additional successful observation increases the evidence by a factor 8, as we then get \text{BF}_{01}(s=10, f=0) = 81 and \text{BF}_{01}(s=11, f=0)= 89.

The hyperparameter b controls the growth rate of the evidence with respect to n as long as s=n, thus, f=0. Mathematically, we say that \text{BF}_{01}(s=n, f=0) is of the order n^{b}, which implies that if b > 1 the Bayes factor \text{BF}_{01}(s=n, f=0) grows quickly, while for b < 1 slowly, and for b=1 linearly. The following plot depicts the three cases.

Each curve in this plot represent Bayes factors \text{BF}_{01}(s=n, f=0) with s=n as a function of n based on a=1. For the dotted brown curve we have b=0.5 and observe a very slow increase. The solid blue curve shows the case with b=1, and the dashed green curve is based on b = 1.5 resulting in a Bayes factor \text{BF}_{01}(s=n, f=0) that grows so quickly that it runs off the chart. Entering the last case in JASP shows that with s=n and n=100 we get \text{BF}_{01}(s=100, f=0) = 766.40.

Regardless of the chosen a and b, each curve increases, which implies that observing only successes s=n from a larger sample provides more evidence for the null than from a smaller sample. For any sample of size n consisting of only successes s=n, the Bayes factor \text{BF}_{01}(s=n, f=0) remains bounded and, thus, P(\mathcal{M}_{1} \mid s=n, f=0) > 0, which implies that the alternative is not ruled out with certainty. For instance, with s=n and n=1,000,000 the Bayes factor is \text{BF}_{01}(s=1,000,000, f=0)=1,000,001 and the skeptic’s posterior model probability for the alternative remains then P(\mathcal{M}_{1} \mid s=1,000,000, f=0) = 0.000003. This number may be small, but it is not zero.

The black swan

The Bayes factor follows the rule of inductive reasoning; as long as only successes are observed, the evidence for the null keeps increasing. At the same time, no matter how many successes we have already observed, the alternative hypothesis can never be ruled out with certainty, i.e., P(\mathcal{M}_{1} \mid s=n, f=0) > 0. A posterior model probability of P(\mathcal{M}_{1} \mid s=n, f=0) = 0, thus, P(\mathcal{M}_{0} \mid s=n, f=0) = 1 implies that we argue that “because all swans we have seen so far are white, we have proven that all other swans must also be white”. However, the general statement that “all swans are white” is logically false as soon as one black swan is observed. Similarly, the observation of a single Alzheimer’s patient without fungal material in the brain decisively falsifies the null hypothesis \mathcal{H}_{0} that all Alzheimer’s patients have fungal material in their brain.

Suppose now that the experiment is continued, and we observe an Alzheimer’s patient without fungal material, resulting in s=10 successes and f=n-s=1 failure. Entering these observations into JASP shows that \text{BF}_{01}(s=10, f=1) = 0, which implies that P(\mathcal{M}_{0} \mid s=10, f=1) = 0 as long as P(\mathcal{M}_{0}) < 1. In other words, the observation of one single failure utterly destroys the null hypothesis \mathcal{H}_{0} : \theta = 1, as it should.

The Bayes factor compared to the posterior of the parameter

The null hypothesis, however, will not be destroyed, if we mistreat the testing problem as one of estimation. As discussed in a previous blog post, when estimating the effect, we can focus on the posterior median of \theta that serves as a best guess for the magnitude of the effect, and the 95%-credible interval can be used as a measure of uncertainty about this best guess for \theta.

For instance, with s=n, thus, f=0, and n=10, we get a posterior median of \tilde{\theta}=0.939 and a 95%-credible interval of [0.715, 0.998]. Note that if we change the test value to, say, \theta = 0.5, the 95%-credible interval remains the same, which is a first hint that \pi(\theta \mid d) does not take the null hypothesis into account. The introduction of the black swan, only leads to a shift of \pi(\theta \mid d) with a posterior median at \tilde{\theta}=0.864 and a 95%-credible interval of [0.615, 0.979]. Recall that with the test value set to \theta=1, we have infinite evidence against the null, i.e., Bayes factor \text{BF}_{01}(s=10, f=1)=0, or equivalently, \text{BF}_{10}(s=10, f=1)=\infty. By default JASP then does not produce the prior and posterior plot. To have JASP display the credible interval nonetheless, we changed the test value to \theta=0.9999, but any other test value would provide the same 95%-credible interval. (Please ignore the reported Bayes factor at the top of this plot, which is the result of the Bayes factor with the null being \theta = 0.9999 instead of \theta = 1.)

The difference in behavior between the Bayes factor and \pi(\theta \mid d) can best be explained by the different questions these two quantities try to address. The Bayes factor takes the null seriously, addresses the existence problem and updates the prior model odds to posterior odds. In contrast, \pi(\theta \mid d) quantifies the uncertainty about the magnitude of the unknown parameter \theta, which by assumption is not fixed at one. In other words, a pre-condition of \pi(\theta) and \pi(\theta \mid d) is the null being false, and the alternative being true.

The Bayes factor compared to the p-value

The p-value also destroys the null hypothesis once a black swan case is observed, however, it does not gradually accumulate evidence in favor of the null, or communicate uncertainty whenever only successes are observed, i.e., when s=n for n increasing.

For a sample of size n=11 resulting in s=10 successes, thus, f=1, we get p = 0 as displayed in the far right column of the main table of the JASP output screen. Hence, a single failure will cause the null hypothesis to be rejected at any positive level of \alpha, no matter how small.

On the other hand, when s=n and n=10, thus, f=0, we get p=1, but this is also the case for any other n. Hence, for the null hypothesis under test, the p-value is indifferent to the size of the sample, which provides a concrete demonstration of the fact that the p-value does not quantify learning or evidence here. As p=1 is larger than any chosen \alpha < 1, and \alpha =0.05 in particular, the decision is then to not reject the null. This sounds like good news, as the experimenters wanted to provide evidence for the null here. Unfortunately, not rejecting the null does not imply that we can accept the null hypothesis as true.

It is worth mentioning that the p-value is not the posterior probability of the null model. Instead, the p-value is calculated under the assumption that the null is true and provides the chance of the value of the statistic, in this case S, and more extreme, but not observed, potential outcomes of S. For the null hypothesis \mathcal{H}_{0}: \theta = 1 and s=10 out of n=10, the more extreme, but not observed, outcomes are S=0, 1, 2, \ldots, 9 and the p-value is the collective chance of observing S=0, 1, 2, \ldots, 9, 10. This is equivalent to summing up the height of the bars from the following bar plot:

On the other hand, whenever s \neq n, for instance, s=9 then the p-value is the sum of the chances of observing S=0,1,2, \ldots, 8 and S=9, which are all zero whenever \mathcal{H}_{0}: \theta =1. Summing all these zeroes yields again zero, which is why we get p=0.

It is interesting to note that to calculate \pi(\theta \mid d) the null is presumed to be false and not taken into account at all, whereas to calculate a p-value the null is assumed to be true and the alternative is not taken into account at all. For the calculation of a Bayes factor, in contrast, both the null and the alternative model are taken into account.

The p-value and the Bayes factor are both methods for testing, but it is good to realize that they have different purposes. The procedure to reject the null when p < \alpha focuses on all-or-none decision making, whereas the goal of a Bayes factor \text{BF}_{01}(d) is to quantify the graded evidence provided by the data in favor of the null over the alternative. With a more graded assessment the substantive experts –and not the statistician– can make the decision to accept or reject the null hypothesis, if such a decision is required. In practice, we always make decisions under uncertainty, and for transparent reporting, we should also communicate the uncertainties with which we make these decisions, for instance, using the Bayes factor and the posterior model probabilities. The decision itself might have profound consequences. For instance, acceptance of the null might imply further research on the fungi hypothesis and the development of new synthesis techniques of specific anti-fungal antibodies, whereas a rejection might lead to the funding of another stream of research, which might require a new brain scanner. Both these policies come with costs. By acknowledging and taking the model uncertainties into account (i.e., not only the decision of accept or reject) in our forecasts, we can better assess the risks of the two policies.

Conclusion

It needs to be emphasized that this blog post is written without the benefit of any knowledge on Alzheimer’s disease, and that we focused on the art and science of learning from data. We oversimplified the results of Pisa et al. (2015) to simple counts, while the study itself was more involved and much better reasoned.2

More convincing than plain statistical evidence, would be to discover the mechanism with which the fungi causes Alzheimer. The authors seem to be well aware of this fact and also focus on this aspect of the research.

Nonetheless, we believe that the analysis based on the binomial model provides some statistical insight. Firstly, we mentioned the limits of inductive reasoning, which we believe affect all statistical methods, whether Bayesian or frequentist. The black swan example illustrated that it is impossible to prove a causal claim from observational data alone. In future blog posts, we elaborate on other ideas of Jeffreys regarding the Bayes factor.

Thanks to Eric-Jan Wagenmakers, Tim Draws, Maarten Marsman, and Alexander Etz for their comments on an earlier draft that helped me to write this blog post.


 

Like this post?

Subscribe to our newsletter to receive regular updates about JASP including our latest blog posts, JASP articles, example analyses, new features, interviews with team members, and more! You can unsubscribe at any time.


Footnotes

1 This is an oversimplification, as Pisa et al. (2015) is a replication of their own work where they first investigate one Alzheimer’s brain before they considered ten others. Moreover, to further simplify the analysis, we ignore the fact that Pisa et al. (2015) also studied the brains of controls.

2 In fact, the Pisa et al. (2015) is a replication study of their previous finding, which might be interesting to study using replication Bayes factors.

References

Jeffreys, H. (1961). The Theory of Probability. Oxford University Press, Oxford, UK, 3rd edition.

Ly, A., Raj, A., Etz, A., Marsman, M., Gronau, Q. F., & Wagenmakers, E.-J. (2018). Bayesian Reanalyses from Summary Statistics: A Guide for Academic Consumers. Advances in Methods and Practices in Psychological Science, 1(3), 367-374.

Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (2016a). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. Journal of Mathematical Psychology, 72, 19-32.

Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (2016b). An evaluation of alternative methods for testing hypotheses, from the perspective of Harold Jeffreys. Journal of Mathematical Psychology, 72, 43-55. This is reply to Christian Robert’s comment, and to Suyog H. Chandramouli and Richard M. Shiffrin’s comment on our previous paper in which we summarize the ideas of Harold Jeffreys on hypothesis testing.

Pisa, D., Alonso, R., Rábano, A., Rodal, I., & Carrasco, L. (2015). Different brain regions are infected with fungi in Alzheimer’s disease. Scientific reports, 5, 15015.

About the author

Alexander Ly

Alexander Ly is the CTO of JASP and responsible for guiding JASP’s scientific and technological strategy as well as the development of some Bayesian tests.