Teaching Bayesian Estimation with the Summary Stats Module

Background

The JASP Summary Stats module allows practitioners to complement frequentist analyses with a Bayesian alternative, and do so easily and efficiently. As the name suggests, the Summary Stats module only requires commonly-reported summary statistics such as the observed t-value and the sample size N. The ability to conduct Bayesian analyses from summary statistics is particularly useful when the raw data are not publicly available. For instance, you might read a published article and see the frequentist summary “r(13)=.56, p<.05”. How much evidence against the null hypothesis do the data provide? How robust is that evidence to changes in the prior scale? What parameter ranges are more credible than others? The Bayesian reanalyses from the Summary Stats module allow researchers –and also editors, reviewers, readers, and reporters– to answer these questions with just a few keystrokes and mouse clicks. The paper that shows the Summary Stats module in action has just been accepted for publication in Advances in Methods and Practices in Psychological Science; a preprint is available here.

In this post we focus on the educational aspects and elaborate on what type of additional information a Bayesian reanalysis can provide. Specifically, the goal of this first blog post is to introduce basic notions of Bayesian estimation and how the posterior can be used to extract an informative best guess for the effect size and a measure of uncertainty about this best guess. In a next blog post, we elaborate on the basics of Bayes factors and discuss the idea of maximal evidence. Finally, in a subsequent blog post we elaborate on replication Bayes factors, and how they can be computed in JASP.

The ideas conveyed here are illustrated using the seminal Festinger and Carlsmith (1959) study on cognitive dissonance. You can follow along and replicate all analyses in this blog post by activating the Summary Stat module in JASP, via the + icon next to the Common tab at the top of the JASP window. Alternatively, you can also download the full analysis as an annotated JASP file from our OSF folder.

The Festinger & Carlsmith (1959) Cognitive Dissonance Study

The Summary Stats module greatly facilitates a Bayesian reanalysis of classical results. For instance, in the landmark publication Festinger and Carlsmith (1959, hereafter FC) outlined a theory to account for cognitive dissonance, a phenomenon they described as follows: “If a person is induced to do or say something which is contrary to his private opinion, there will be a tendency for him to change his opinion so as to bring it into correspondence with what he has done or said” (p. 209). Early experiments on cognitive dissonance (e.g., Kelman, 1953) induced participants to make a statement contrary to their personal opinion for the chance to gain a reward. It was hypothesized that for greater rewards there would be a greater change to the opinion, but the data showed the reverse: the smaller the reward, the greater change in opinion. FC proposed a theory that could account for this behavioral pattern, which they subsequently put to the test in an ingenious experiment.

FC’s experiment included control, high reward, and low reward conditions, each with twenty participants. All participants performed a boring task for one hour, after which they were asked to take a survey and answer questions about, amongst other things, their enjoyment of the study. Where the conditions differ is what happens after completing the boring task, but before completing the survey. In the reward conditions, participants were asked to interact with a confederate by telling them that the experiment was interesting and fun; for this they received either twenty dollars (high reward) or one dollar (low reward). In the control condition participants went straight to the post-interview and did not interact with the confederate. According to FC, the crucial test of their theory lies in comparing the post-interview enjoyment ratings from the low versus high reward conditions, where the low reward condition is predicted to have higher enjoyment ratings. In line with their theory’s prediction, FC found a higher mean enjoyment rating in the low reward group than in the high reward group, t(38)=2.22, p=.032, and this was taken as support for their theoretical position. In other words, based on the p-value, it was decided that the null hypothesis is unlikely and they concluded that the effect of cognitive dissonance exists. A secondary question involves the magnitude of the effect, but no effect size estimate is reported in the original paper. However, this can be easily computed from the t-value and group sizes, giving a Cohen’s d of d=0.702, since

(1)   \begin{align*} d = t / \sqrt{N_{\delta}} \end{align*}

where N_{\delta} = 1/(1/N_{1} + 1/N_{2}) is known as the effective sample size. In this case, N_{\delta} = 10.

Bayesian Reanalysis: Bayesian Estimation, Priors, and Posteriors

We wish to conduct a Bayesian reanalysis of the FC result, but the raw data from this study are no longer available. However, the Summary Stats module in JASP affords a comprehensive Bayesian reanalysis using only the test statistic reported in the original paper.

Entering the reported t-value, sample sizes for the two groups, and clicking the “Prior and Posterior” and “Additional info”:

yields the following result:

The dotted line represents the default (testing) prior for effect size \delta under \mathcal{H}_{1}: a zero-centered Cauchy distribution, here with a default scale of 0.707.

One way to interpret this prior is that under \mathcal{H}_{1} –that is, presuming that in the population the effect \delta is present– the expectation is that the effect \delta is most likely to be small, although the possibility that it is large is not ruled out.

The solid line is the posterior distribution for effect size in the population, that is, the knowledge about effect size \delta obtained after updating the prior distribution using the observed data, and assuming that \mathcal{H}_{1} holds. This posterior distribution of \delta has a median of 0.571 (note that the prior distribution has shrunk the sample value of d=0.702 toward zero) and a relatively wide 95% credible interval that ranges from -0.032 to 1.197.1 The credible interval informs us that 95% of the posterior mass lies in the interval from -0.032 to 1.197; clearly, the effect has not been estimated with much precision.

In general, the posterior distribution quantifies all that we know about effect size \delta, given that \mathcal{H}_{1} holds and the effect exists. In other words, the posterior of \delta addresses questions of the type: “Under the presumption that the population effect size \delta exists: (a) What is the magnitude of \delta? (b) Since this estimate of \delta is based on a small sample of the population, how uncertain are we about the estimate of \delta?” The first question (a) is answered here with the posterior median that serves as a best guess for \delta (i.e., \delta = 0.571), while the 95% credible interval is as measure of uncertainty about the best guess and addresses (b) (i.e., p(\delta \in [-0.032, 1.197] | \text{data}, \mathcal{H}_{1}) = .95).

Bayesian Estimation: The Role of the Prior Scale

So far, we only considered the estimation problem based on a zero-centered Cauchy prior with scale 0.707. In general, the prior becomes more peaked as the scale decreases. This implies more weight on zero and, therefore, a larger shrinkage towards zero. To change the scale of the prior, we go to the “Prior” tab and replace 0.707 by, say, 0.1. This yields:

The scale of the prior shrinks the posterior median further towards zero, as it decreased from 0.571 to 0.193, but note that the length of the credible interval is about the same as when we used a prior scale of 0.707. On the other hand, if we change the prior scale to 2, we get the following:

Now the shrinkage towards zero decreases and the posterior median increased from 0.571 to 0.675. Again the length of the credible interval is not affected much by a change in prior scale. In general, as the prior becomes wider, the more closely the posterior median resembles Cohen’s d.

Large Sample Behavior

The influence of the prior washes out as the sample size increases. Let’s double the number of participants in each group (i.e., N_{1} = N_{2}=40) but retain the same Cohen’s d of d=0.702. The t-value then becomes t(78)=3.14. Setting the Cauchy scale back to 0.707 and entering these values into the Summary Stats module yields:

Observe that the posterior median increases from 0.571 to 0.633, which is closer to the Cohen’s d of d=0.702, and the length of the 95% credible interval decreases from 1.229 to 0.89.

In general, the shrinkage towards zero lessens (because we used a zero-centered Cauchy prior) when the sample size increases, and, ultimately, when the sample sizes are large enough, the posterior median will be indistinguishable from Cohen’s d. Similarly, the credible interval becomes more narrow as the sample size increases. Hence, as more data are collected, the posterior becomes more peaked and we gain more certainty about the point estimate. This large sample behavior implies that the influence of the prior vanishes as the sample size increases. Priors other than the Cauchy have the same large sample behavior as long as they assign some prior mass on the possible values of \delta.

Intermediate Conclusion

We discussed the problem of estimating the population effect size \delta under the presumption that it exists, and showed how one can summarize a posterior distribution using a point estimate (i.e., the posterior median) and a measure of uncertainty about this estimate (i.e., the credible interval). Furthermore, we showed how the prior scale influences how much the posterior median is shrunk to zero. Finally, we made intuitive that with large enough samples the influence of the prior washes out and the choice of prior does not matter.

In a next blog post we provide some explanation of the pizza plots at the top of the figure and elaborate on the Bayes factor.


 

Like this post?

Subscribe to our newsletter to receive regular updates about JASP including our latest blog posts, JASP articles, example analyses, new features, interviews with team members, and more! You can unsubscribe at any time.

Footnotes

1 Your credible interval and posterior median might vary slightly, due to the random sampling method that is used to calculate them. In a next version of JASP, a more stable computational method is used for the calculations resulting in less variability.

References

Ly, A., Raj, A., Marsman, M., Etz, A., & Wagenmakers, E.-J. (in press). Bayesian reanalyses from summary statistics: A guide for academic consumers. Advances in Methods and Practices in Psychological Science.

About the author

Alexander Ly

Alexander Ly is the CTO of JASP and responsible for guiding JASP’s scientific and technological strategy as well as the development of some Bayesian tests.