How to Conduct a Classical Independent Sample T-Test in JASP and Interpret the Results

The independent samples t-test is used to assess whether the means of two populations are equal to each other. This blog post shows how to perform the classical independent samples t-test, i.e., the two-sample t-test, in JASP. For this demonstration we consider the following example.

Example on the Effect of Extra Reading Activities on Drp Scores

Moore et al. (2009, p. 451) describe an educator who believes that an eight-week period with specific extra reading activities in the classroom (on top of the normal curriculum) improves third-graders’ degree of reading power (henceforth, drp) scores.1

To investigate this claim, the null hypothesis that extra reading activities do not affect drp scores was put to the test.

More specifically, the hypothesis is concerned with two (imaginary) populations: A control population in which every third grader in the United States only follows the normal curriculum, and a treatment population in which every third grader in the United States receives the specific eight-week extra reading activities on top of the normal curriculum. Jointly the pupils of the control population are assumed to have a mean drp score of \mu_{1}, while the population mean drp score of the pupils in the treatment condition is denoted by \mu_{2}. The null hypothesis of no effect is operationalized as \mathcal{H}_{0}: \mu_{1} = \mu_{2}, which implies that the population mean drp score is the same regardless of whether the population of third graders receive the treatment or not.

To test the null hypothesis, the educator collected samples from each population: One classroom of 23 pupils in the control condition, and 21 pupils in the treatment condition. For each of these groups of pupils, a sample mean drp score can be calculated that are denoted by \bar{y}_{1} for the control group and \bar{y}_{2} for the treatment group. Even if the null hypothesis is true, we cannot expect these sample means to be exactly equal to each, because we only collected small samples from two larger populations. The independent samples t-test takes into account the uncertainty due to having only limited number of participants by supposing normally distributed error. Furthermore, to account for the imbalance due to differing sample sizes, it is assumed that the two populations share a common variance (see the section below for a more elaborate discussion regarding the assumptions and how they can be checked).

Note: to follow along with the explanation, you can either download the dataset and follow the steps on your own or download the annotated JASP file to see exactly what was done in JASP.

Performing the Independent Samples t-Test in JASP

The input

To perform the two-sample t-test in JASP, load the data file and go to the Common analysis tab, select T-Tests and then select the Independent Samples T-Test. The dependent variable is the (continuous) variable we want to test, in this case the drp scores. In the Grouping Variables field we put the variable “groups”. This grouping variable indicates whether a drp score was measured from the control or the treatment condition. The standard test is student’s t-test. The other two tests (Welch and Mann-Whitney) are discussed in the section concerned with the assumptions below.

As the treatment is expected to yield higher drp scores, we test the null hypothesis of no effect with only one tail, which can be done by ticking “Group 1 < Group 2”. To figure out which group corresponds to which group number, click on Descriptives under Additional Statistics. In this case, the control condition is assigned to 1, while treatment is assigned the number 2. (You can edit the labels by clicking the header of the data set in data view, see this GIF for more details.)

Additionally, we have the option to include the Location parameter, the Effect size, Descriptives, a Descriptives plot and the (mysterious) Vovk-Sellke maximum p-ratio.

The output

Let’s take a look at the output.

The descriptives table provides information about the number of pupils per group, n_{1} = 23 and n_{2}=21, the sample means, \bar{y}_{1} = 41.52 and \bar{y}_{2}=51.48, the observed standard deviations s_{1}=17.15 and s_{2}=11.01, and the standard error. The standard error is the observed standard deviations divided by the square root of the sample size. For instance, the standard error of the control group is given by s_{1}/\sqrt{ n_{1}} = 17.149 / \sqrt{23} = 3.576.

The test table provides the test statistics. Here you can find the t-value, the degrees of freedom, the p-value, the mean difference, standard error of the difference, and Cohen’s d, which provides an estimate of the population effect size. The footnote verifies that we specified the one-tailed test correctly: we expect that the mean drp score of the control population to be less than that of the treatment population.

From the t-test table we see that the t-statistic is -2.267. The formula below shows that the t-statistic can be interpreted as the difference between the observed means weighted by its precision (inverse of the pooled standard error):

(1)   \begin{align*} t = \frac{ \bar{y}_{1} - \bar{y}_{2} }{ s_{p} / \sqrt{n_{\delta}}} , \end{align*}

where n_{\delta} = 1/(1/n_{1} + 1/n_{2}) denotes the effective sample size, s_{p} the pooled standard deviation, and s_{p}^{2} the pooled variance defined as

(2)   \begin{align*} s_{p}^{2} = \frac{(n_{1} - 1) s_{1}^{2} + (n_{2} -1) s_{2}^{2}}{n_{1} + n_{2} - 2}. \end{align*}

The denominator of s_{p}^{2} is also known as the degrees of freedom. In this example we have 23+21-2=42, which corresponds to the output in the test table. With the formulas and the information provided by the descriptives table, we can also (up to rounding error) compute the observed t-statistic and Cohen’s d by hand. In this case, the observed mean difference is 41.52-51.48 = -9.96 (note the rounding error compared to the number reported in the test table). For the denominator of the t-statistic, we note that the observed pooled variance is given by

(3)   \begin{align*} s_{p}^{2} \approx  \frac{ 22 \times 17.15^2 + 20 \times 11.01^2}{23+21-2} = 211.79 \end{align*}

and therefore s_{p} \approx \sqrt{ 211.79} = 14.55. The effective sample size is n_{\delta} = 1/(1/23 + 1/21) =10.98, thus, \sqrt{n_{\delta}} = \sqrt{10.98} = 3.31 and the denominator of the t-statistic is, therefore, 14.55/3.31 = 4.40. Combining these results yields a rounded t-value of

(4)   \begin{align*} t \approx \frac{ - 9.96 }{ 4.40} = -2.26 . \end{align*}

In this example, the experiment led to a one-tailed p-value of p=0.014. This means, that if (1) the implicit normality assumption holds, and (2) it were true that extra reading activities does not affect the mean drp scores of third-graders, then the chance of randomly drawing a sample with a t-value of t=-2.267 or lower is 1.4%.

Note that this statement is reasoned from the population to the sample — it presupposes that the null hypothesis holds and the population means are equal to each other. From this statement about the population, we derive implications to our observations, but also for more extreme –but not observed– data sets. For instance, data sets that lead to a t-value of, say, t=-3 or even t=-14.

Within psychology p-values lower than 0.05 are often considered to be “statistically significant”, which is typically used to decide that there is an effect. Based on this convention, the null hypothesis of no effect is rejected, and it is decided that extra reading activities does lead to higher drp scores in the population of third-graders. Note that it is always a good idea to broaden one’s inferential perspective and consider information other than the p-value alone by complementing the analysis with, for instance, confidence intervals, effect size estimates, or a Bayesian reanalysis, see a previous blog post for further details.

If extra reading activities do indeed lead to higher drp scores, one might wonder how strong the effect is. The effect in the population is denoted by

(5)   \begin{align*} \delta = \frac{\mu_{1} - \mu_{2}}{\sigma}, \end{align*}

where \sigma is the unknown common standard deviation of both the control and treatment population. None of these quantities are observed, since we do not have full access to the whole populations. Instead, we replace the population quantities by their observed versions, resulting in a Cohen’s d that is used as an estimate of the unknown \delta, that is,

(6)   \begin{align*} d= \frac{\bar{y}_{1} - \bar{y}_{2}}{s_{p}}. \end{align*}

Reusing the calculations for the t-value shows that d = -0.68, which aligns with what is reported in the test table. There is no strict rule for interpreting Cohen’s d, but a rough guideline accompanied with some explanation can be found here. In our example, the effect size (d = -0.68) is indicative of medium to large effect.

Checking the Assumptions


The one-tailed student’s p-value and the observed t-statistic of t=-2.267 depend on two assumptions: (1) the two data sets are normally distributed, and (2) the two populations share a common variance \sigma^{2}. To check these assumptions, click the boxes Normality and Equality of variances under Assumption Checks. Normality is tested with the Shapiro-Wilk’s test and equality of the variance is tested with Levene’s test.

For our example, both tests yield non-significant p-values. The p-values of the Shapiro-Wilk’s tests are computed under the assumption that the drp scores (in general the dependent variables) grouped according to their condition are normally distributed. As both p-values are larger than 0.05, the hypothesis of normality cannot be rejected. Similarly, the p-value of Levene’s test assumes that the two population variances are the same and since p > 0.05, the hypothesis of equal variance cannot be rejected. Note that the non-significant p-values do not provide evidence that the populations are actually normally distributed with equal variance; they only indicate that the assumptions cannot be rejected.

If the Levene’s test yields statistical significance, while Shapiro-Wilk’s does not, or if it is known, in advance, that the data sets are normally distributed with unequal variances, one can then test the null hypothesis of equal means, that is, \mathcal{H}_{0}: \mu_{1} = \mu_{2} using the Welch test.

On the other hand, if the Shapiro-Wilk’s test yields statistical significance, or if it known, in advance, that the drp scores (in general the dependent variable) grouped according to their conditions are not normally distributed, one can then conduct a Mann-Whitney U test, instead. The null hypothesis of the Mann-Whitney U test states that the location parameter of the two populations are the same regardless of the underlying distribution. These tests can be selected in the input window and it can be prudent to execute these tests regardless, as a way to check the robustness of the results. In addition to these tests, it is always recommended to visualize the data, as can be done using the options available in the Descriptives menu on the JASP ribbon:


 

Like this post?

Subscribe to our newsletter to receive regular updates about JASP including our latest blog posts, JASP articles, example analyses, new features, interviews with team members, and more! You can unsubscribe at any time.

Footnotes

1 Adapted from from Maribeth C. Schmitt, “The effects of an elaborated directed reading activity on the metacomprehension skills of third graders,” PhD thesis, Purdue University, 1987 according Moore et al. (2009, p. N-10).

References

Moore, D. S., McCabe, G. P. McCabe, & Craig, B. A. Introduction to the practice of statistics. Freeman, New York, 7. international ed., 2012

About the authors

Alexander Ly

Alexander Ly is the CTO of JASP and responsible for guiding JASP’s scientific and technological strategy as well as the development of some Bayesian tests.

Lotte Kehrer

Lotte Kehrer is a student at the Department of Psychological Methods at the University of Amsterdam. She is contributing to the blog, YouTube channel and manual of JASP.