In this blog post we elaborate on the ideas behind Harold Jeffreys’s Bayes factor and illustrate this test with the Summary Statistics module in JASP.
In a previous blog post we discussed the estimation problem, where the goal was to infer, from the observed data, the magnitude of the population effect. Before studying the size of an effect, however, we arguably first need to investigate whether an effect actually exists. Here we address the existence problem with a hypothesis test and we emphasize the difference between testing and estimation.
The outline of this blog post is as follows: Firstly, we discuss a hypothesis proposed in a recent study relating fungal infections to Alzheimer’s disease. This hypothesis is then operationalized within a statistical model, and we discuss Bayesian model learning in general, before we return to the Alzheimer’s example. This is followed by a comparison of the Bayes factor to other methods of inference, and the blog post concludes with a short summary.
Running example
To address the existence problem we compare two hypotheses. An example of a hypothesis is
This hypothesis was proposed by Pisa, Alonso, Rabano, Rodal and Carrasco (2015), who conducted an experiment to assess the evidence for this null hypothesis. The alternative hypothesis is denoted by and implies that not all Alzheimer’s patients have fungal infections. Note that the hypotheses are statements about the population consisting of all (i.e., past, present, and future) patients suffering from Alzheimer’s disease. As we will never have this population at hand, we cannot check every patient for fungal material, and we can never be 100% certain that the hypothesis
holds true.
Statistical models: Reasoning from the general population to all possible outcomes
Due to sampling error, statements about the population such as do not transfer directly to the relatively small samples of patients. To account for this error we assume a statistical model for the data. The data will consist of the number of successes
and the number of failures
, where
is the number of Alzheimer’s patients where fungal material are found from a sample of size
. The standard model for this situation is known as the binomial distribution which depends on the sample size
and the binomial rate parameter
.
Linking the hypotheses to the experiment
This model allows us to make predictions from the hypothesized general population about all possible outcomes. For instance, when the potential outcomes are
number of successes. Within the binomial model, the null hypothesis is operationalized as
which implies that if the hypothesis that is concerned with the population holds true, then every Alzheimer’s patient in the sample has fungal material in his or her brain. In other words, before the experiment is run and if the null holds, then the outcomes
are predicted to occur with 0% chance, while
is predicted to occur with 100% chance.
On the other hand, the alternative hypothesis is operationalized as
by which we mean that there is some fixed, but unknown, proportion of the Alzheimer population that have fungal material in their brains. For instance, if
and
, then there is a 24.6% chance to observe
, a 20.5% chance to observe
, and a 11.7% chance to observe
successes.
In fact, there’s a small chance of 0.098% to observe successes. Similarly, if
then there’s only a 10.3% chance to see 5 successes, a 20% chance to see 6 successes, and a 26.7% chance to see 7 successes. Likewise there’s a small, but non-zero, chance to see
successes out of
, see the figure below.
The alternative hypothesis implies that any strictly between zero and one is possible. Note that we used the model to reason from the hypothesized general population to all possible outcomes, denoted with a capital
before the experiment is run. Furthermore, the predictions are stochastic meaning that they are expressed as chances.
Bayesian statistics: Reasoning from a particular observation to the general population
Before the data are observed we also set a prior model probability , a number between zero and one, that expresses our belief about the null hypothesis. For instance, a skeptic might choose
and, therefore, has more confidence in the alternative, as
. Formalizing the beliefs of the hypotheses as prior model probabilities allows us to use Bayes rule to compute posterior model probabilities
and
that represent the updated probabilities for the two models after observing the data
consisting of a particular realized number of successes
of the experiment based on a sample of size
. This leads to the crucial equation
where refers to the Bayes factor in favor of the null over the alternative provided by the data
. As we explain below, the Bayes factor quantifies the relative predictive performance of the rival hypotheses.
Bayesian model learning
For instance, the skeptic’s prior beliefs of and
implies a prior model odds of one-to-three, that is,
. This means that a Bayes factor of
is needed to convince the skeptic that the fungi hypothesis is more likely to be true than not. With the prior model odds and a Bayes factor at hand, the posterior model odds follow from a simple multiplication. As we only entertain two models, and if the posterior model odds is of the form numerator / denominator, then the posterior model probabilities can be retrieved as
For instance, if then the posterior model odds will be two-and-a-half-to-three and leads to
and
. While if
then the posterior model odds will be four-to-one resulting in
and
.
Note that if , we automatically have
, which implies that we have no doubt that the null holds true. For these prior model probabilities we automatically get
and
regardless of the data that are observed. Hence, setting
(or
) implies that we are not open to learn from data, and we therefore assume that
is strictly between zero and one.
It’s worth mentioning that Bayes factors are not the same as posterior model odds, and that Bayes factors do not depend on the prior model probabilities and
. At the time of writing, there’s no general option yet to specify prior model probabilities in JASP, but the program does provide Bayes factors. One of the advantages of the Bayes factor is that we can focus on the evidence provided by the data, and let everybody update their own model priors.
Bayes factor as the relative predictive adequacy of one model over the other
We are trying to update our knowledge (i.e., the prior model odds) by considering the predictive performance of the rival hypotheses in light of the observed data. The relative predictive performance of these hypotheses is known as the Bayes factor. In this scenario, it is defined as follows
where in the numerator is a number, which is retrieved by evaluating the binomial likelihood at the null hypothesis
given the observed number of successful counts
and sample size
. Similarly,
in the denominator is a number for each possible
between zero and one within the alternative model.
In JASP, this Bayes factor is equivalently expressed in terms of the observed number of successes and failures
since the sum of the successes and failures equals the size of the samples .
The function is also referred to as a prior, and could be thought of as the experimenter’s uncertainty about the unknown
within the model
, which we distinguish from the experimenter’s between-model uncertainty expressed by
and
.
According to my reading of Jeffreys, in his Bayes factor construction does not reflect prior knowledge; instead,
can best be viewed as a weighting function necessary to convert the likelihood function of the alternative model into a single number. More specifically, for
the integral sign
implies that at each possible
in
the likelihood
is multiplied with the weight
and, subsequently, summed. As such,
is a weighted average, i.e., the marginal likelihood, and leads to a number. As a result, the Bayes factor is a ratio of two numbers and therefore a number itself. For the binomial model, we typically use for
the beta distribution with hyperparameters
and
. In JASP
and
is set by default, but a user can change these values.
Calculating the default Bayes factor using the Summary Stats module
To get a better understanding of , we first activate the Summary Stat module in JASP by clicking the “+” sign, and then go to “Frequencies” followed by “Bayesian Binomial Test” and tick “Prior and posterior” and “Additional info” under ”Plots”. The default beta prior for
with
and
is shown to be the uniform distribution.
To see Bayesian model learning in action, we first change the test value to corresponding to the null hypothesis we would like to test, and enter the observed number of successes
and
failures as reported in Pisa et al. (2015, p. 5).1
Note that the results of the analysis appears immediately, and after selecting in the user interface, JASP reports a Bayes factor of
in favor of the null over the alternative, which provides direct evidence in favor of the experimenters’ working hypothesis over the alternative and leads to the the following plot.
In this figure the Bayes factor is visualized using the so-called Savage-Dickey density ratio which states that under certain conditions
This equation implies that is equal to the posterior for the parameter
within the alternative model divided by the prior at the test point, in this case,
. Hence, for the test the focus in these prior and posterior plots should be on the relative heights of the dots. In this case, the posterior evaluated at
is at height 11, whereas the prior evaluated at
is at height 1.
To highlight that the Bayes factor does not equal the posterior model probabilities, note that for the skeptic a Bayes factor of implies that her prior model odds of one-to-three are updated to eleven-to-three resulting in
and
. Hence the skeptic cannot rule out either hypothesis after observing the data, but she can conclude that the null hypothesis is now more likely than it was before.
How the hyperparameters of the beta distribution influence the Bayes factor
The analysis based on has some interesting properties, as with this setting –and as long as we see no failures– each additional success increases the evidence for the null over the alternative by one. For instance, entering
successes in JASP instead of
, we immediately get
.
In general, with and for
we get
. For instance, if
then each additional successful observation increases the evidence by a factor 8, as we then get
and
.
The hyperparameter controls the growth rate of the evidence with respect to
as long as
, thus,
. Mathematically, we say that
is of the order
, which implies that if
the Bayes factor
grows quickly, while for
slowly, and for
linearly. The following plot depicts the three cases.
Each curve in this plot represent Bayes factors with
as a function of
based on
. For the dotted brown curve we have
and observe a very slow increase. The solid blue curve shows the case with
, and the dashed green curve is based on
resulting in a Bayes factor
that grows so quickly that it runs off the chart. Entering the last case in JASP shows that with
and
we get
.
Regardless of the chosen and
, each curve increases, which implies that observing only successes
from a larger sample provides more evidence for the null than from a smaller sample. For any sample of size
consisting of only successes
, the Bayes factor
remains bounded and, thus,
, which implies that the alternative is not ruled out with certainty. For instance, with
and
the Bayes factor is
and the skeptic’s posterior model probability for the alternative remains then
. This number may be small, but it is not zero.
The black swan
The Bayes factor follows the rule of inductive reasoning; as long as only successes are observed, the evidence for the null keeps increasing. At the same time, no matter how many successes we have already observed, the alternative hypothesis can never be ruled out with certainty, i.e., . A posterior model probability of
, thus,
implies that we argue that “because all swans we have seen so far are white, we have proven that all other swans must also be white”. However, the general statement that “all swans are white” is logically false as soon as one black swan is observed. Similarly, the observation of a single Alzheimer’s patient without fungal material in the brain decisively falsifies the null hypothesis
that all Alzheimer’s patients have fungal material in their brain.
Suppose now that the experiment is continued, and we observe an Alzheimer’s patient without fungal material, resulting in successes and
failure. Entering these observations into JASP shows that
, which implies that
as long as
. In other words, the observation of one single failure utterly destroys the null hypothesis
, as it should.
The Bayes factor compared to the posterior of the parameter
The null hypothesis, however, will not be destroyed, if we mistreat the testing problem as one of estimation. As discussed in a previous blog post, when estimating the effect, we can focus on the posterior median of that serves as a best guess for the magnitude of the effect, and the 95%-credible interval can be used as a measure of uncertainty about this best guess for
.
For instance, with , thus,
, and
, we get a posterior median of
and a 95%-credible interval of
. Note that if we change the test value to, say,
, the 95%-credible interval remains the same, which is a first hint that
does not take the null hypothesis into account. The introduction of the black swan, only leads to a shift of
with a posterior median at
and a 95%-credible interval of
. Recall that with the test value set to
, we have infinite evidence against the null, i.e., Bayes factor
, or equivalently,
. By default JASP then does not produce the prior and posterior plot. To have JASP display the credible interval nonetheless, we changed the test value to
, but any other test value would provide the same 95%-credible interval. (Please ignore the reported Bayes factor at the top of this plot, which is the result of the Bayes factor with the null being
instead of
.)
The difference in behavior between the Bayes factor and can best be explained by the different questions these two quantities try to address. The Bayes factor takes the null seriously, addresses the existence problem and updates the prior model odds to posterior odds. In contrast,
quantifies the uncertainty about the magnitude of the unknown parameter
, which by assumption is not fixed at one. In other words, a pre-condition of
and
is the null being false, and the alternative being true.
The Bayes factor compared to the p-value
The -value also destroys the null hypothesis once a black swan case is observed, however, it does not gradually accumulate evidence in favor of the null, or communicate uncertainty whenever only successes are observed, i.e., when
for
increasing.
For a sample of size resulting in
successes, thus,
, we get
as displayed in the far right column of the main table of the JASP output screen. Hence, a single failure will cause the null hypothesis to be rejected at any positive level of
, no matter how small.
On the other hand, when and
, thus,
, we get
, but this is also the case for any other
. Hence, for the null hypothesis under test, the
-value is indifferent to the size of the sample, which provides a concrete demonstration of the fact that the
-value does not quantify learning or evidence here. As
is larger than any chosen
, and
in particular, the decision is then to not reject the null. This sounds like good news, as the experimenters wanted to provide evidence for the null here. Unfortunately, not rejecting the null does not imply that we can accept the null hypothesis as true.
It is worth mentioning that the -value is not the posterior probability of the null model. Instead, the
-value is calculated under the assumption that the null is true and provides the chance of the value of the statistic, in this case
, and more extreme, but not observed, potential outcomes of
. For the null hypothesis
and
out of
, the more extreme, but not observed, outcomes are
and the
-value is the collective chance of observing
. This is equivalent to summing up the height of the bars from the following bar plot:
On the other hand, whenever , for instance,
then the
-value is the sum of the chances of observing
and
, which are all zero whenever
. Summing all these zeroes yields again zero, which is why we get
.
It is interesting to note that to calculate the null is presumed to be false and not taken into account at all, whereas to calculate a
-value the null is assumed to be true and the alternative is not taken into account at all. For the calculation of a Bayes factor, in contrast, both the null and the alternative model are taken into account.
The -value and the Bayes factor are both methods for testing, but it is good to realize that they have different purposes. The procedure to reject the null when
focuses on all-or-none decision making, whereas the goal of a Bayes factor
is to quantify the graded evidence provided by the data in favor of the null over the alternative. With a more graded assessment the substantive experts –and not the statistician– can make the decision to accept or reject the null hypothesis, if such a decision is required. In practice, we always make decisions under uncertainty, and for transparent reporting, we should also communicate the uncertainties with which we make these decisions, for instance, using the Bayes factor and the posterior model probabilities. The decision itself might have profound consequences. For instance, acceptance of the null might imply further research on the fungi hypothesis and the development of new synthesis techniques of specific anti-fungal antibodies, whereas a rejection might lead to the funding of another stream of research, which might require a new brain scanner. Both these policies come with costs. By acknowledging and taking the model uncertainties into account (i.e., not only the decision of accept or reject) in our forecasts, we can better assess the risks of the two policies.
Conclusion
It needs to be emphasized that this blog post is written without the benefit of any knowledge on Alzheimer’s disease, and that we focused on the art and science of learning from data. We oversimplified the results of Pisa et al. (2015) to simple counts, while the study itself was more involved and much better reasoned.2
More convincing than plain statistical evidence, would be to discover the mechanism with which the fungi causes Alzheimer. The authors seem to be well aware of this fact and also focus on this aspect of the research.
Nonetheless, we believe that the analysis based on the binomial model provides some statistical insight. Firstly, we mentioned the limits of inductive reasoning, which we believe affect all statistical methods, whether Bayesian or frequentist. The black swan example illustrated that it is impossible to prove a causal claim from observational data alone. In future blog posts, we elaborate on other ideas of Jeffreys regarding the Bayes factor.
Thanks to Eric-Jan Wagenmakers, Tim Draws, Maarten Marsman, and Alexander Etz for their comments on an earlier draft that helped me to write this blog post.
Like this post?
Subscribe to our newsletter to receive regular updates about JASP including our latest blog posts, JASP articles, example analyses, new features, interviews with team members, and more! You can unsubscribe at any time.Footnotes
1 This is an oversimplification, as Pisa et al. (2015) is a replication of their own work where they first investigate one Alzheimer’s brain before they considered ten others. Moreover, to further simplify the analysis, we ignore the fact that Pisa et al. (2015) also studied the brains of controls.
2 In fact, the Pisa et al. (2015) is a replication study of their previous finding, which might be interesting to study using replication Bayes factors.
References
Jeffreys, H. (1961). The Theory of Probability. Oxford University Press, Oxford, UK, 3rd edition.
Ly, A., Raj, A., Etz, A., Marsman, M., Gronau, Q. F., & Wagenmakers, E.-J. (2018). Bayesian Reanalyses from Summary Statistics: A Guide for Academic Consumers. Advances in Methods and Practices in Psychological Science, 1(3), 367-374.
Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (2016a). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. Journal of Mathematical Psychology, 72, 19-32.
Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (2016b). An evaluation of alternative methods for testing hypotheses, from the perspective of Harold Jeffreys. Journal of Mathematical Psychology, 72, 43-55. This is reply to Christian Robert’s comment, and to Suyog H. Chandramouli and Richard M. Shiffrin’s comment on our previous paper in which we summarize the ideas of Harold Jeffreys on hypothesis testing.
Pisa, D., Alonso, R., Rábano, A., Rodal, I., & Carrasco, L. (2015). Different brain regions are infected with fungi in Alzheimer’s disease. Scientific reports, 5, 15015.