The Wonderful World of Marginal Means

This post was inspired by a conversation I had with Henrik Singmann, maintainer of the glorious afex package.

The latest iteration of JASP, version 0.12, features a much sought after functionality in ANOVA’s: specifying custom contrasts! This development sparked a lively discussion with some team members about the available options when following up on a detected main effect in an ANOVA. When an ANOVA leads to the conclusion that the group means differ across groups, one can investigate this difference in JASP through post hoc tests, contrast analysis, or even marginal means. The question then arises: how do these analyses differ from each other? Statistically speaking – not that much. Yet, scientifically speaking they do. In this post, I aim to explain these similarities and differences by using custom contrasts in combination with a mixed ANOVA example from Andy Field: Looks & Personality. The JASP file can be found on the OSF.

This fictional data set, “Looks or Personality”, provides preference scores of eight men (I excluded the first two rows for illustrative purposes) and ten women for nine speed-dating partners who varied in their Attractiveness (attractive, average, ugly) and Charisma (high, some, none). This means we have two within subjects factors (Attractiveness and Charisma), and one between subjects factor (gender). The main analysis gives us quite overwhelming evidence that the average preference scores differ for the three levels of Charisma (see the descriptives plot below for an illustration).

To follow up on this main effect there are either the post hoc tests or the contrast analyses. In JASP, even though these are under separate menus, in both instances the underlying analysis code uses the emmeans package. This wonderful package, developed by Russell Lenth, is used by JASP throughout the frequentist (RM) ANOVA analyses for estimating the marginal means of the levels of each factor or combination of factors. These means are then either reported directly and/or tested against 0 (under the marginal means menu), compared to all other marginal means (under the post hoc tests menu), or compared to specific marginal means (under the contrasts menu).

Marginal means

For starters, what are marginal means? Often, marginal means are equal to the descriptive means. However, in some cases, for instance in the case of unbalanced designs or inclusion of other variables in the model, the two differ. This is because the descriptive means are based solely on the observed data, whereas the marginal means are estimated based on the statistical model.

In order to illustrate this, here is the table with the marginal means for Charisma:

For a RM ANOVA model where Gender is included in the model.

For a RM ANOVA model where Gender is excluded from the model.

If we would have computed the mean for the “High Charisma” group based on the descriptives, the result would have been (10*89.6 + 8*88.25 + 10*88.4 + 8*82.625 + 10*86.7 + 8*56) / 54 = 82.63. So, in case gender is included in the model, the marginal means estimate tries to estimate what the mean would have been, had there been a balanced design (i..e, an equal number of mens and women). Here it does not make a huge difference, but it is definitely something to think about and take into account. As can be seen from the standard errors and range of the confidence intervals of the estimates, including gender in the model also makes for a more specific estimate (i.e., lower standard error and narrower confidence interval).

The descriptives plot illustrates why adding gender to the model increases the accuracy of the model and marginal means estimate: there seems to be a gender difference in the “High Charisma” group.

Comparing Marginal Means

Instead of estimating the marginal means per cell, there are two ways of testing for a difference between (combinations of) cells: post hoc tests and contrast analysis. The computation of these comparisons are statistically equivalent, but depend on one central question: “Were these comparisons postulated by theory, before looking at the data? ” If not, the resulting p-values ought to be corrected for multiplicity1, which is what the post hoc tests do. If the comparisons were planned, however, a contrast analysis can be conducted.

In case we would want to test every level of Charisma to each other, the corresponding post hoc analysis looks as follows:

In the contrast analysis, there are several options for specifying which levels to compare to each other. For instance, the “simple” contrast refers to comparing each level of the factor to its first level. In order to recreate the post hoc table, we need some more flexibility, and so can use the new “custom” contrast. Selecting this option brings up the following menu, where each level of the factor is listed, along with the possibility to specify rows of contrast coefficients:

This results in the following contrast analysis table:2

Comparing the two tables shows where the differences lie between contrast and post hoc analysis. The estimates of the differences in marginal means are the same, as well as their standard errors and t-statistics. However, both the p-values and confidence intervals vary: the corrected p-values are typically higher, and the confidence intervals wider, for the post hoc analysis. This brings us back to the scientific difference between these analyses: if you have comparisons that are governed by theory, and are planned before seeing the data, the hypothesis test is fine as is. In case these comparisons are explored after seeing the data, there needs to be more conservative testing (e.g., the Bonferroni and Holm methods), because it is very easy to just throw all possible comparisons into your favorite software and fish for significant findings. In case you want to read more about the need for multiplicity comparison, here is a great article where researchers found neural correlates when placing a dead salmon in an fMRI scanner.

TLDR in 3 steps:

  • Marginal means are a great metric, governed by the specified model.
  • While post hoc and contrast analysis both compare marginal means, unplanned comparisons of marginal means require correction for multiplicity. This underscores the need for careful statistical planning and considering what your theory predicts, before starting the experiment.
  • Custom contrasts are in JASP 0.12, which gives a greater flexibility in how you want to compare marginal means.


1 In short, this means using more conservative p-values to balance for the practice of conducting many (unplanned) comparisons, in order to keep the type 1 error rate (i.e., falsely rejecting the null hypothesis) in check.
2 I am still not sure of the best way of displaying the contrast weights in the “comparison” column. Feel free to suggest alternatives on the JASP GitHub!


Like this post?

Subscribe to our newsletter to receive regular updates about JASP including our latest blog posts, JASP articles, example analyses, new features, interviews with team members, and more! You can unsubscribe at any time.

About the authors

Johnny van Doorn

Johnny van Doorn is a PhD candidate at the Psychological Methods department of the University of Amsterdam. At JASP, he is responsible for Bayesian nonparametric analyses.