Generalized Linear Models (GLM) in JASP

It took a while, but finally, the frequentist Generalized Linear Model (GLM) has become available in JASP, as part of the Regression module! In this blog post, we give you a quick introduction to the idea behind GLM and the full functionality of this new JASP sub-module. We also show you how you can conduct a binomial regression analysis using this GLM sub-module. Finally, we discuss some potential future features.

Generalized Linear Models

A generalized linear model (GLM) is a flexible extension of ordinary linear regression. A widely used GLM is binary logistic regression, which had long been available as a stand-alone module in JASP.

Generally speaking, a GLM consists of a random component and a systematic component:

  • The random component specifies an appropriate probability distribution for the response variable. For example, a binomial distribution is appropriate for modelling proportions; in ordinal linear regression, the Gaussian distribution is used.
  • The systematic component specifies how the explanatory variables relate to the mean of the response. For example, in binary logistic regression, the logit link function is used to map the responses (i.e. probabilities) to the linear combination of predictors (i.e. linear predictor); in ordinary linear regression, the link function is the identity function.

See below a table presenting the types of response data that the GLM sub-module can handle and the corresponding appropriate distributions (also called family) and links. The asterisk * indicates the canonical/default link function for a specific family.

These distributions/families and links are also available in the current GLM JASP module.

The new JASP GLM Module

You can find the GLM module under the Regression tab. Note that only the classical (a.k.a frequentist) version is available for now.

Make sure to check out the help file by clicking on the “i” button. There, you can find more information about what each function does, what assumptions to look out for, and so on.

Input Variable Panel

Like with most other JASP analyses, you find the input panel where you can specify the variables of interest for the model, including the dependent (i.e. response or outcome) variable, covariates (i.e. quantitative variables) and factors (i.e. qualitative variables). In addition, you can specify weights for the model. You also need to select the family (i.e. distribution for the response variable) and the link function.

Model

In the Model panel, you can decide what model terms to include in the null model, to which the model of interest can be compared (e.g. using a likelihood-ratio test). You can also choose whether the model should include an intercept term.

Statistics

In the Statistics panel, three tables are possible.

  1. The default table (which is always printed) that compares the model of interest against the null model across different criteria (e.g. AIC, BIC).
  2. The model fit table that provides goodness-of-fit information about the model of interest based on deviance or Pearson goodness-of-fit tests.
  3. The coefficients table that provides parameter estimates, standard errors, test statistics, p-values and confidence intervals.

Diagnostics

With the Diagnostics panel, different tools are available for you to examine model assumptions. For instance, there are various options to conduct visual analyses of residuals. You can also inspect potential outliers, influential cases and multicollinearity issues in the model.

Estimated Marginal Means and Contrast Analysis

In this panel, you can compute estimated marginal means for any predictor of interest. When the predictor is a factor, the estimated marginal means are computed for all the levels of the predictor. When the predictor is continuous, you can customise the levels where the marginal means are estimated, by specifying the number of standard deviations to the mean. You can also conduct contrast analyses to, for example, compare two marginal means within a predictor. For more about this, check out: https://jasp-stats.org/2020/04/14/the-wonderful-world-of-marginal-means/.

Advanced Options

This panel gives you the option to set a random seed. The default is 1. It is only necessary to set a random seed if you would like to obtain reproducible plots and tables that are based on quantile residuals in the Diagnostics Section. The reason is that the calculation of quantile residuals involves drawing random values from a distribution, which leads to different values of quantile residuals from one draw to another.

Tutorial: Binomial Regression

Data

In this tutorial, we consider the turbines data set, which concerns the proportion of turbine wheels developing fissures. This data set can be downloaded via this link. It is also available via the GLMsData library in R (Dunn & Smith, 2018). The original source is “Wayne Nelsen (1982). Applied Life Data, Wiley, 407-409.”

There are three variables of interest:

  1. Proportion: the proportion of turbine wheels developing fissures.
  2. Turbines: the total number of turbine wheels.
  3. Hours: the number of hours the turbine wheel was run.

Model Specification

The idea is to model the relationship between the proportion of turbine wheels developing fissures and the number of run hours. To do so, we first select the Binomial family, which is an appropriate distribution for modelling proportions. We choose the (default) logit link here because it provides easy interpretation of model parameter estimates.

Next, we move the “DV” variable to the “Dependent variable” field, the “Hours” variable to the “Covariates” field, and “Turbines” to the “Total Number of Trials” field (which is named “Weights” when a different family is selected).

For the Model panel, we keep everything as it is by default. That is, we include an intercept term in the model and do not add any additional term to the null model.

For the Statistics panel, tick the following boxes.

Model Interpretation

Figure 1. Model summary, fit and coefficients.

Figure 1 shows three relevant tables that summarize the model.

The first table, “Model Summary”, compares the model of interest (H1) which includes a covariate (i.e. Hours) to the null Model (H0) which contains only an intercept term. We can see that in terms of deviance, AIC and BIC, the H1 model is superior to the null model. The likelihood-ratio test also suggests that H1 is statistically significantly better than H0 (X2 = 102.34, p < 0.001).

The second table, “Model Fit”, compares the model of interest (H1) to the saturated model using Deviance and Pearson goodness-of-fit tests. Both test results (i.e. large p-values) suggest that our H1 model is not statistically significantly worse than the saturated model, which indicates good fit.

The third table, “Coefficients”, presents the values of our model parameter estimates, standard error, test statistics (in this case, z), p-values, and 95% confidence intervals. We can see that the covariate, “Hours”, is statistically significantly predictive of the proportion of turbine wheels developing fissures. Specifically, for every additional hour a turbine wheel is run, the odds of the turbine wheel developing fissures increases by (exp(0.001) – 1)*100% = 0.1%. If we consider increasing Hours by 1000 h, the odds of a turbine developing fissures increases by (exp(0.001*1000) – 1)*100% = 171.82%. Note that we round the estimate for “Hours” from 9.992e-4 to 0.001 here for convenience.

Diagnostics

Visual Analysis of Residuals

To identify structural problems in our model, we can plot the residuals against fitted values and residuals against each explanatory variable. Specifically, there are two important aspects to consider:

  • Trends: If any trends appear in these plots, it is an indication that the systematic component of the model can be improved. This could mean changing the link function, adding extra explanatory variables, or transforming the explanatory variables.
  • Constant variation: If the random component is correct (that is, the correct distribution is used), the variance of the points is approximately constant.

Therefore, the ideal plots should contain no patterns or trends.

Note that for discrete distributions like the binomial distribution, quantile residuals (as opposed to deviance and Pearson residuals) are encouraged, because they help to avoid distracting patterns in the residuals. However, as mentioned earlier, the calculation of quantile residuals involves drawing random values from a distribution, which leads to different values of quantile residuals from one draw to another. Therefore, to obtain the exact same results shown below, you need to use the same random seed (1) that is used here. To do so, go to the Advanced Options panel and tick the “Set seed” option:

Next, tick the following boxes under Quantile Residuals to ask for the relevant residual plots.

We can see that in both plots (Figure 2 and 3), the data points are spread out more or less randomly, without an apparent trend. This suggests a lack of evidence against our specification of the model.


Figure 2. Standardised quantile residuals vs. fitted values.


Figure 3. Standardised quantile residuals vs. hours (the predictor).

Note that with the residuals vs. fitted values plot (Figure 2), the x-axis is not on the original scale of the fitted values but transformed. The reason for using a transformation is to make the fitted values spread out more evenly horizontally so that trends in the plots are easier to detect. Such a transformation is called variance-stabilizing transformation (or the constant-information scale) and it is automatically applied in JASP.

Furthermore, we can use a Q-Q plot of residuals to detect large residuals (see Figure 4). We see that there are no large residual outliers (or maybe one, in the upper right corner).


Figure 4. Q-Q plot of standardised quantile residuals.

To further examine the link function, we can plot the working responses against the linear predictor. If the link function is appropriate, then the plot should be roughly linear. If a noticeable curvature is apparent in the plot, then a different link function should be considered.


Figure 5. Working responses vs. linear predictor.

In Figure 5, we see a roughly linear trend, suggesting that the choice of the logit link function is appropriate. However, just to demonstrate what happens when an inappropriate link function is selected, we show in Figure 6 several diagnostic plots based on the Cauchit link function that is less appropriate for this data set.


Figure 6. Diagnostic plots based on the Cauchit link function.

We can see that in Figure 6, the left plot shows a curvilinear trend, indicating an inappropriate link function. The residuals vs. fitted values plot (in the middle) and the residuals vs. predictor plot (on the right) are also less random compared to when the logit link function was used.

Back to the original situation where the logit link function is used. To determine if a covariate is included on the correct scale, we can use a partial residual plot, which should show a linear trend if the covariate is included on the correct scale. We can see below in Figure 7 that this is more or less the case.

We can see below that this is more or less the case (only slightly curvilinear).


Figure 7. Partial residual plot.

Outliers


To detect potential outliers, we can ask for a table that shows cases with the top n (default: 3) largest residuals. In this case, there are no alarmingly large outliers that would require our attention, if we choose the threshold to be 3.


Figure 8. Table of top n outliers based on quantile residuals.

Influential Cases

Influential cases are outliers with high leverages. We can ask for those cases based on different measures. If we select “Covariance Ratio”, we see that case 4 and 10 (the numbers correspond to the row numbers in the original data set) are potential influential cases that might require further investigation.


Figure 9. Influential cases table.

Multicollinearity


As our model contains only one explanatory variable, there is no risk of multicollinearity. However, in models with multiple predictors, we can request the tolerance and variance inflation factor (VIF) statistics to assess potential issues with multicollinearity.

Future

Currently, only frequentist GLM is available. We aim for its Bayesian counterpart in the near future. We are also going to include other GLMs, such as ordered logistic regression and multinomial logistic regression for this frequentist GLM sub-module.

If you run into trouble (e.g. bugs, questions) using the GLM module, or if you think of more useful features to include, feel free to make an issue via https://github.com/jasp-stats/jasp-issues so that we can follow them up.

Cheers!

References

Dunn, P. K., & Smyth, G. K. (2018). Generalized linear models with examples in R. New York: Springer.

Dunn, P. K., & Smyth, G. K. (2018). GLMsData: Generalized Linear Model Data Sets. R package version 1.0.0. https://CRAN.R-project.org/package=GLMsData

About the authors

Qixiang Fang

Qixiang Fang is a PhD candidate at Utrecht University.

Šimon Kucharský

Šimon is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.