This post is a teaser for van den Bergh, D., Wagenmakers, E., & Aust, F. (2022). Bayesian Repeated-Measures ANOVA: An Updated Methodology Implemented in JASP. Preprint available on PsyArXiv: https://psyarxiv.com/fb8zn/
In JASP 0.16.3 we changed the default Bayesian repeated-measures ANOVA. It is important to understand this change as it may affect the results, bringing them more in line with the frequentist ANOVA. The older method is still available in JASP, but is no longer the default. In this post we will explain what the new default means and how it differs from our previous version and the status quo.
Repeated-measures Analysis of Variance (ANOVA) is among the most used statistical tests in psychological science. It’s used to test the influence of one or more experimental manipulations on a continuous outcome. For example in 1935, John Stroop asked participants to name the color of a printed word. He observed that participants respond faster when a word’s meaning and color are congruent (“blue”), yet slower when meaning and color were incongruent (“blue”). This classic phenomenon is now known as the Stroop effect. To test whether the reaction times in the congruent and incongruent conditions differ significantly, we can use a repeated-measures ANOVA.
Given that ANOVA is used so often you might expect the theory and implementations to be worked out entirely, but the opposite is true! With the advance of Bayesian statistics in recent decades, a Bayesian ANOVA was born (Rouder et al., 2012). Since then there are now (at least) two ways to conduct an ANOVA. Classically, with p-values, or in a Bayesian way, with Bayes factors. This statistical diversity is a good thing as it provides multiple angles to analyze the same data. However, if the classical and Bayesians results contradict each other, we have a problem. And as you might expect, we’ve encountered quite a few situations where people reported on the JASP forum that the classic and Bayesian repeated-measures ANOVA provided contradicting results.
Here we dive into an example where the classic and Bayesian ANOVA contradict each other, identify the source of the discrepancy, and propose a solution to bring the classical and Bayesian results into closer agreement. We’ll use a data set on the Stroop effect kindly shared with us by Hershman et al. (2022). The data are publicly available in the JASP data library.
Hershman and colleagues studied how the Stroop effect is affected by breaks. As in Stroop’s experiment, participants responded to congruent and incongruent words. Additionally, they saw colored nonwords, such as “XXXX”, that made up a neutral condition. Furthermore, there were “break” trials where a black square was shown, indicating that participants did not need to respond. Altogether, this experiment uses a 3 (Congruency: congruent vs. neutral vs. incongruent) by 2 (Preceding trial: Break vs. Stroop task) repeated-measures design. Here we analyze the data after aggregating trials for each person.1
Before analyzing the data, it’s always good practice to visualize the descriptives of the data. Here we do so with a raincloud plot:
Dots represent each participant’s average response times for trials preceded by breaks and trials preceded by stroop trials (each participant contributes two dots per congruency condition). The incongruent means appear slightly slower than the congruent means, but the differences are not self-evident. Here is another raincloud plot for Preceding trial:
Again, the differences don’t speak for themselves. We’ll need to use statistics to shed light on the data!
The classical results are shown below.
The main effect of Preceding trial and the interaction effect Preceding trial ✻ Congruency are not significant (p > 0.05) while the main effect of Congruency is significant (p < .001). In addition, the JASP table footnote informs us that the assumption of sphericity is violated. If we enable sphericity corrections the conclusions remain unchanged.
Now let’s look at the Bayesian results in JASP 0.16.2 and before.
The best two models are Preceding trial + Congruency and the model with only Preceding trial. If we summarize across all models we obtain the following table:
Strikingly, there is evidence in favor of including Preceding trial, inconclusive evidence for Congruency, and evidence against the interaction effect (BFexcl = 1 / 0.310 ≅ 3.226). These results are completely at odds with the classical analysis.
So what drives the difference? Well, you would be forgiven for thinking this is another example where the use of prior distributions and Bayes factors suggests different conclusions than the classic p-values. But in this case, even the models used in the two approaches are different and this leads to conflicting conclusions. In the classical ANOVA, the full model contains random slopes for all but the highest order repeated-measures interaction. The Bayesian ANOVA does not include these random slopes. You might wonder, what happens if we add these random slopes to the Bayesian ANOVA? Well, this is exactly what we changed in JASP 0.16.3. The Bayesian results are now in agreement with the classical results:
The “Analysis of Effects” table shows that if we average across the models we now find overwhelming evidence in favor of including Congruency and indecisive evidence for the other two effects:
Putting it altogether, we obtain these results.
From version 0.16.3, JASP now includes random slopes for all but the highest order repeated-measures interaction by default. It’s possible to replicate the behavior from older versions by checking Legacy results, which can be found under Additional Options.
JASP inherited the lack of random slopes from the R package BayesFactor (Morey & Rouder, 2022), which powers the Bayesian ANOVAs. We’re in touch with the maintainer of the BayesFactor package to implement a similar patch so that R users also avoid this discrepancy. Until then, we recommend avoiding the function anovaBF() and instead recommend using the functions lmBF() or generalTestBF() to add random slopes manually.
Given the example above a natural question is, are my results with JASP (or BayesFactor) wrong? Well, there is, unfortunately, no way to tell for sure without reanalyzing the data. However, note that the discrepancy can only arise if there are two or more repeated-measures factors, if there is only one repeated-measures factor then the results should be unchanged. Even if there are two or more repeated-measures factors, if the random slopes are relatively small then the conclusions shouldn’t change. But the only way to know for sure is to reanalyze the data while including random slopes.
Note that the legacy analysis is not “incorrect” per se; if the assumption of an absence of individual differences is reasonable, the legacy analysis may be more appropriate than the new default. However, we believe that the new default analysis is more widely applicable and connects better to what users think they are getting when they execute the analysis. We hope this blog post and the associated preprint will be helpful to obtain a better understanding of the models at hand. We suspect that this is not the last post on this topic, so please let us know if there is anything you wish to see explained more! Finally we thank Mattan S. Ben-Shachar for a spirited Twitter discussion that prompted us to finish this post.
Hershman, R., Dadon, G., Kisel, A., & Henik, A. (2022). Resting Stroop task: Evidence of task conflict in trials with no required response. Unpublished manuscript.
Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18(6), 643–662. https://doi.org/10.1037/h0054651
Rouder, J. N., Morey, R. D., Speckman, P. L., & Province, J. M. (2012). Default Bayes factors for ANOVA designs. Journal of Mathematical Psychology, 56(5), 356–374. https://doi.org/10.1016/j.jmp.2012.08.001
Morey, R. D. & Rouder, J. N. (2022). BayesFactor: Computation of Bayes Factors for Common Designs [Computer software]. R package version 0.9.12-4.4. Retrieved from https://CRAN.R-project.org/package=BayesFactor