# Frequentist and Bayesian Equivalence Testing in JASP

This post demonstrates the Equivalence Testing Module, new in JASP 0.12. In traditional hypothesis testing, both frequentist and Bayesian, the null hypothesis is often specified as a point (i.e., there is no effect whatsoever in the population). Consequently, in very large samples, small but practically meaningless deviations from the point-null will lead to its rejection. In order to take into account the possibility that the null hypothesis may hold only approximately, “equivalence testing” involves the specification of an interval around the point-null. This interval contains all values that the researcher deems “too small to be meaningfully different from 0” (Morey, & Rouder, 2011, p. 407).

## Example: Implicit Stereotyping

For this demonstration, we use the “implicit stereotyping” data set from Moon and Roeder (2014).1 The study tests the theory that implicit activation of a social identity can either facilitate or worsen performance on a task, depending upon the stereotype associated with the identity. The study by Moon and Roeder is a replication of Shih, Pittinsky, and Ambady (1999). The original authors hypothesized that Asian-American women perform better on a math test than a control group when they are primed with their Asian identity. The data from Moon and Roeder contain 53 women in the Asian-primed group and 48 in the control group (condition is coded as identity_salience). The dependent variable is accuracy (i.e., the number of items a participant answered correctly divided by the total number of items on a math test). To follow along, you can download the annotated JASP file. Below we first discuss the frequentist equivalence test and then turn to the Bayesian equivalence test.

## The Frequentist “Two One-Sided Tests” (TOST)

Let’s start by activating the equivalence testing module from the ‘+’ popup menu and then selecting the equivalence independent samples t-test. We drag our dependent variable accuracy to the Variables box and identity_salience to the Grouping Factor box. Then we have to define our equivalence region. As we are not interested in small effect sizes, we specify our equivalence region between -0.05 and 0.05 in Cohen’s d units.

The frequentist equivalence test is based on the two one-sided hypothesis tests (TOST) from the TOSTER R-package (Lakens, 2017). As the name implies, the procedure consists of two one-sided hypothesis tests: One test to check whether the effect size is smaller or equal to the set lower bound, L, and one test to examine whether the effect size is bigger than or equal to the upper bound, U. Only if both hypotheses are rejected, we can reject the hypothesis of non-equivalence (i.e., that the conditions are practically different from each other). This procedure yields the same results as checking if the 90% two-sided confidence interval (for an of 0.05) falls entirely within the set equivalence region. This visual check can be performed in JASP by selecting the “equivalence bounds plot” under “Additional Statistics”.

Let’s take a look at the output for the Moon and Roeder data:

From the equivalence independent samples t-test table we can see that both the lower bound test (t(99) = -0.826, p = 0.795) and the upper bound test (t(99) = -1.328, p = 0.094) yield a non-significant result. The equivalence bounds plot in Figure 1 confirms that the 90% confidence interval exceeds both the lower and the upper equivalence bound. Thus, we are not entitled to reject non-equivalence, that is, we have to remain open to the possibility that there is a meaningful difference between the control group and the Asian-primed group. However, as the classical two-sided t-test is non-significant, we also cannot reject the null hypothesis that the groups are equal. Therefore, we have to state that there is insufficient data to draw strong conclusions.

Figure 1. The equivalence bounds plot displays the 90% confidence interval of the mean difference in accuracy between the Asian-primed group and the control group. The user-defined equivalence region is displayed as a grey area. The mean difference in accuracy is -0.037. As positive scores indicate that the Asian-primed condition performs better and negative scores mean that the control condition performs better, these results are inconsistent with the stereotyping hypothesis.

## Bayesian Interval-Null Testing

Now we perform the Bayesian version of the equivalence independent samples t-test. The left panel of Figure 2 shows the JASP input options that we specified for this analysis.

Figure 2. JASP screenshot of the Bayesian equivalence test.

In the right panel of Figure 2, the first table presents the default output and shows four Bayes factors, one per row, quantifying evidence for and against several hypotheses. These hypotheses are schematically represented in Figure 3.

Figure 3. Overlapping and non-overlapping hypotheses. is defined by an unrestricted prior distribution for effect size. A restricted version of can be obtained by bounding effect size so that it either falls inside an equivalence interval I (top model, ) or outside that interval (right model, ). Comparisons of and with concern overlapping hypotheses, whereas the comparison between and concerns non-overlapping hypotheses.

Thus, the overlapping-hypothesis (OH) Bayes factor compares the interval-null hypothesis against the unconstrained alternative hypothesis. The OH test addresses the question “to what degree do the data support the restriction of the parameter space to the equivalence region?” The non-overlapping-hypothesis (NOH) Bayes factor compares the interval-null hypothesis against the alternative hypothesis where the effect size is constrained to fall outside the equivalence region. The NOH test addresses the question “to what degree do the data support the proposition that the parameter lies inside versus outside the equivalence region?”

Returning to the JASP output table from Figure 2, the top two rows of the table display the Bayes factor for the OH test, and the bottom two rows show the Bayes factors for the NOH test (for this example we use the default Cauchy prior with a scale of 0.707). This table shows that the OH Bayes factor in favor of the interval-null is 2.847, and the NOH Bayes factor in favor of the interval-null is 3.118. Therefore, both Bayes factors show weak evidence that the effect size falls in the equivalence region, and that the groups are not meaningfully different from each other. When the interval-null has a prior probability of 0.50, a Bayes factor of 3 results in a posterior probability of 0.75, leaving a probability of 0.25 for the alternative hypothesis.

In general, the Bayes factor quantifies the shift from prior to posterior mass inside the equivalence region (e.g., Klugkist et al., 2005). When the mass inside the equivalence region has increased from prior to posterior, then the Bayes factor will show evidence in favor of a parameter estimate inside the equivalence region; when the mass inside the equivalence region has decreased from prior to posterior, then the Bayes factor will show evidence against a parameter estimate inside the equivalence region. By selecting “Prior and posterior mass” under “Additional Statistics”, a table containing the prior and posterior mass is displayed and with this information we can reproduce the Bayes factors. The OH Bayes factor in favor of the interval-null, , equals the posterior mass divided by the prior mass inside the equivalence interval:

Note that due to rounding errors, the calculated Bayes factor is slightly off from the true Bayes factor of 2.847.2 The OH Bayes factor quantifying evidence that the parameter lies outside the interval, , can be reproduced in the same manner, but now by dividing the posterior by the prior mass outside the interval:

We need these two OH Bayes factors to calculate the NOH Bayes factors. The NOH Bayes factor quantifying evidence for the interval-null, , is calculated by dividing the by the :

The Bayes factor favoring the hypothesis that the parameter lies outside the interval, , is calculated by reversing the denominator and the numerator, namely 0.913 / 2.844 = 0.321.

In the left panel of Figure 2, you can see that the “Prior and posterior” is selected, which produces Figure 4. This figure confirms that the mass inside the interval (displayed as the grey area) is larger under the posterior distribution than it is under the prior distribution. Thus, as the Bayes factors already indicated, there is some (weak) evidence that the difference between the Asian-primed group and the control group is practically negligible.

Figure 4. The prior and posterior plot, where the grey area is the specified equivalence region.

## Directed Equivalence Testing

We have briefly explained the new equivalence test in JASP 0.12. However, throughout our entire analysis, we tested an undirected hypothesis, whereas our initial hypothesis was directed: according to the theory under scrutiny, when primed with their Asian identity Asian-American women were expected to perform better on a math test than a control group. So instead of testing if there is no meaningful difference, regardless of its direction, we could instead like to test whether or not participants in the Asian-identity condition perform better than participants in the control condition. But before we execute this analysis, let’s visualize the data 3:

Figure 5. Boxplots including jitter element split on condition. Participants in the control condition have a higher average accuracy (0.496) than participants in the Asian-primed condition (0.459).

We see that participants in the control condition have a slightly higher average accuracy (0.496) than participants in the Asian-primed condition (0.459). Thus, the effect goes somewhat in the direction opposite to the one predicted by the theory. This outcome should undercut the alternative hypothesis more than if the data had fallen in the predicted direction.

Let’s recapitulate. Previously, we ignored the hypothesized direction of the effect and used the Bayes factor to compare the predictive performance of an interval-null hypothesis, , either to an undirected alternative hypothesis that overlaps with , or to an undirected alternative hypothesis that does not overlap with . Now we wish to take into account the fact that, if the alternative hypothesis is true, the associated effect sizes are positive only — the test is sharpened. Here we only consider the overlapping hypothesis test, and so we wish to compare the predictive performance of the interval-null hypothesis against that of the positive-only alternative hypothesis . In other words, we seek . Currently, this Bayes factor cannot be obtained directly in the JASP module; however, it can be obtained indirectly, by exploiting the transitive nature of the Bayes factor (see also here). For we have:

On the right-hand side, the first factor we have already computed above as = 2.847. Now we need to compute , and this is easily available in the Equivalence Testing Module; we simply conduct another equivalence test and define the “interval-null” to range from 0 to positive infinity. This analysis (not shown here) produces = 3.205. So we have weak-to-modest evidence (i.e., = 2.847) for the interval-null hypothesis over the unrestricted alternative hypothesis ; in turn, we also have weak-to-modest evidence (i.e., = 3.205) for the unrestricted alternative hypothesis over the restricted alternative hypothesis . Thus, beats , which in turn beats . Hence, by transitivity the Bayes factor for the interval-null against the directed alternative hypothesis equals = 2.847 × 3.205 = 9.124. In other words, the data are about 9 times more likely under the interval-null hypothesis than under the directed alternative hypothesis. If the interval-null hypothesis were given a prior probability of 0.50, these data would increase that to a posterior probability of about 0.90, leaving 0.10 for the directed alternative hypothesis. The data clearly support the interval-null but the directed alternative hypothesis is not out of the running.

On a final note, the Bayesian equivalence test has one other elegant property: as the region of equivalence gradually shrinks to a point, the corresponding Bayes factor gracefully reduces to the standard Bayes factor featuring a point-null hypothesis test. In contrast, tight equivalence bounds are problematic for the classical procedure, as the sample size needed to reject non-equivalence quickly becomes unrealistically large.

### Notes

1 The original study included another condition, but we will only use the Asian-primed and the control condition.

2 We could minimize the impact of rounding errors by increasing the number of displayed decimals via “Preferences” at the left pop-up menu, then to “Results” and adjust under “table options”.

3 We visualized the data with the “Descriptives Statistics” option available in the “Descriptives” menu on the JASP ribbon. After dragging the two variables into the specification boxes, we can select the “Boxplot element” and “Jitter element” under “Boxplots” to produce Figure 5.

## References

Klugkist, I., Laudy, O., Hoijtink, H. (2005). Inequality constrained analysis of variance: A Bayesian approach. Psychological Methods, 10, 477-493.

Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science. https://doi.org/10.1177/1948550617697177

Moon, A., & Roeder, S. S. (2014). A secondary replication attempt of stereotype susceptibility (Shih, Pittinsky, & Ambady, 1999). Social Psychology. https://doi.org/10.1027/1864-9335/a000193

Morey, R. D., & Rouder, J. N. (2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods.

Shih, M., Pittinsky, T. L., & Ambady, N. (1999). Stereotype susceptibility: Identity salience and shifts in quantitative performance. Psychological Science. https://doi.org/10.1111/1467-9280.00111