This is a guest post by Dustin Fife, responsible for the Visual Modeling module in JASP.
Years ago when I worked as a biostatistician, I was assigned to analyze the data for a local luminary in the field of Muscular Sclerosis. This analysis would lead to a conference submission, at least, and likely a publication. The man provided me a longitudinal dataset in “wide” format, where each measurement occasion was listed in a separate column. To analyze the data, I had to convert it to long format (where each measure was in a single column and a new column was created that indicated on which occasion that measurement occured).
Unfortunately, I messed up. Without going into details, let’s just say I accidentally combined two different variables into a single variable. These variables had very different scales (e.g., one variable ranged from zero to 100, while another ranged from zero to one). When I analyzed the data, there happened to be more measures of the zero to 100 variable in the treatment group than in the control group, which meant that the mean of the treatment group was much higher than the control. When I computed p-values and effect sizes, they were massive, yet their values were completely meaningless.
But I didn’t know it. I submitted my results to the luminary. Fortunately one of his research assistants caught my error only hours before we submitted the manuscript. When reanalyzed correctly, there were no detectible differences between the two groups.
That was a humiliating experience for me, one that I swore I would never repeat. But how? How could I avoid publishing nonsense?
Around that time, I discovered the following quote:
“As soon as you have collected your data, before you compute any statistics, look at your data…if you assess hypotheses without examining your data, you risk publishing nonsense” (Wilkinson and the APA Task Force on Statistical Inference, 1999, p. 597).
Had I visualized my data, I would have saved myself the trouble and the embarrassment of messing up so horribly.
But, you live and you learn. And, fortunately for me, that single experience shaped my approach to data analysis. And, now that I occupy the role that I do (Assistant Professor teaching statistics classes), others can benefit from my mistake. And, perhaps you can benefit too.
In this post, I am going to introduce the Visual Modeling module, which is powered by Flexplot, an R package I have spent the last two years developing. This module makes it easy for users to visualize statistical models (and save themselves from publishing nonsense).
General Philosophy of Visual Modeling
Suppose that you had two options for lunch. One venue is a block away, but serves subpar food. The second venue, on the other hand, serves delicious food at a reasonable price. It, however, requires you to scale a steep hill, wade through a chest-deep water canal, and climb a sheer cliff just to arrive at the location.
Which would you choose?
You might choose the second venue on special occasions. But the effort required is too much.
It’s human nature. The more effort an activity requires, the less likely we will be to engage in that activity.
So it goes with plotting. Very few, I think, find plotting useless. Quite the opposite. Visuals allow rapid encoding of information and provide an aesthetic representation of our data. Plots are like the second venue: we know we want them, yet the effort required to produce sound, aesthetically appealing graphics in traditional software can be massive.
The Visual Modeling module, on the other hand, creates aesthetically-appealing visuals that follow empirically-supported heuristics Additionally, these visuals require less effort than performing a t-test.
The General Linear Model Approach
Most students who take a graduate course in statistics very quickly learn that ANOVAs, t-tests, and regressions are really just different expressions of the general linear model. If we dummy-code the groups in a t-test, the intercept is simply the mean of one group and the slope is the difference between the two. In other words, the seemingly critical distinction between categorical versus numeric predictors is actually not all that important. We only need to specify which variable(s) are the predictor(s) and which is the outcome. The computer then handles the computation in the background. This means we can simplify analyses immensely by removing the unnecessary distinctions between various types of tests.
Likewise, when it comes to plotting, shouldn’t we be able to have such flexibility? Shouldn’t the software be able to decide how to plot categorical predictors versus numeric predictors?
That is the responsibility of the Visual Modeling module. With that introduction, let me now introduce you to the four options in the Visual Modeling module: Flexplot, Linear Modeling, Mixed Modeling, and Generalized Linear Modeling.
The purpose of the Flexplot sub-module is to provide a flexible interface dedicated to plotting. This submodule allows for univariate plots (histograms and barcharts), bivariate plots (scatterplots and beeswarm plots), and various multivariate graphics.
To display a histogram or barchart, the user simply needs to specify the variable they wish to plot. If the variable is numeric, it will plot a histogram. If it is categorical, it will plot a barchart.
Under the option, the user has some control over how the results are displayed by choosing different themes. It defaults to the JASP theme, but the user can also choose “Black and white,” “Minimal,” “Classic,” and “Dark”.
For categorical variables, Flexplot defaults to sorting the x-axis by sample size. This wouldn’t make sense, however, if the variable is ordinal, as in the figure below.
In the previous section, we saw that Flexplot is smart enough to display a barchart for categorical variables and a histogram for numeric variables. Likewise, Flexplot is smart enough to figure out what type of bivariate graphic to display, depending on whether the user chooses numeric or categorical predictors (and/or outcomes).
Numeric Predictor, Numeric Outcome
If the user specifies a numeric predictor and outcome, Flexplot will display a scatterplot. It also defaults to showing a nonparametric loess line and displaying a 95% confidence band. These can be changed, of course, as in the gif below. The user can choose a regression, quadratic, or cubic line. The user can also make the datapoints more or less transparent, which will be more important later with multivariate plotting.
Categorical Predictor, Numeric Outcome
There are many different ways to plot these sorts of relationships, some good (e.g., violin plot, gradient plot, raincloud plot), some bad (e.g., bar plots of means), and some mediocre (e.g., standard error plots). When plotting categorical predictors against numeric outcomes, Flexplot utilizes beeswarm plots. It does so because it conveys a lot of information concisely. Beeswarm plots show sample sizes (via the dots), density (via the width of the dots), central tendency (via the solid red dot), and spread (via the “whiskers”), all in one graphic. To produce one of these graphics, the user only needs to include a categorical predictor in the “Independent Variable(s)” box.
As before, the user can reduce the opacity of datapoints and change the theme. There are also additional options. They can choose the amount of “jittering,” or width of the datapoints in the beeswarm plot, in either the X direction, or the Y direction, though the Y jittering is rarely used. (I tend to jitter Y when the outcome variable has a discrete scale, like a likert scale from 1-7).
Also, the user can specify what summary statistics are displayed. It defaults to medians (center red dots) and 25/75 percentiles (lower and upper whiskers, respectively). The user can instead specify standard errors or standard deviations (with means as the center). In both cases, the width of the whiskers is 1 × the standard deviation/standard error. (In other words, they are not 95% confidence intervals).
Categorical Predictor, Categorical Outcome
When modeling a categorical predictor/outcome, most analysts utilize a test, which tests whether the observed cell frequencies differ from that which is expected. When graphing this, Flexplot displays deviations from expected proportions as barcharts; bars that are above the horizontal line indicate situations where the observed frequency was higher than what is expected. Bars lower indicate situations where the observed frequency was lower than expected. These, I think, are far more intuitive to interpret than simple bar charts with frequencies on the Y axis.
Numeric Predictor, Categorical Outcome
The Flexplot submodule actually does not display these very well. The most appropriate graphic for this situation would represent the relationship using an ogive function, much like a logistic regression does. To display these data, I recommend the user instead use the Generalized Linear Modeling submodule, which I will illustrate momentarily.
When the user specifies multiple independent variables, Flexplot follows a few simple rules. First, whatever variable comes first is displayed on the X axis. If a numeric variable comes first, Flexplot will display a scatterplot (or a collection of scatterplots). If a categorical variable comes first, Flexplot will display a beeswarm plots (or a collection of them). The second rule is that the second variable in the Independent Variable(s) box will be displayed as separate lines/colors/symbols. This means that numeric variables will first be categorized (binned), then displayed as different lines/colors/symbols.
The third box (labeled “Paneled Variables(s)”) controls, not surprisingly, which variables are paneled (and how). The first variable entered will be binned (if numeric) and paneled in columns, while the second variable will be paneled in rows. As before, the user can specify the type of line drawn, the opacity of datapoints, whether a confidence band is displayed, jittering, etc.
Flexplot is fantastic in its ability to display multiple variables at once, through the use of colors/symbols/lines, as well as paneling. However, the more variables displayed, the more cognitive load increases. To combat this, Flexplot utilizes “Ghost lines,” which make it easy to compare across panels. Ghost lines simply repeat the fit from one panel to another. For example, in the image below, the ghost line repeats the panel from the Social Worker panel, making it much easier to see that Social Workers make slightly less money than those in the “Other” category.
Linear Modeling Submodule
The Flexplot submodule is dedicated to producing graphics. As such, it allows a great deal of control in how one creates graphics. However, it does not produce any statistical estimates (such as means, mean differences, Bayes Factors, etc.). The Linear Modeling submodule, on the other hand, does produce statistics, but also makes it seamless to generate graphics alongside statistical estimates.
One of my guiding rules for data analysis is that every statistical model must have a graphic that represents the model. In the Linear Modeling submodule, these graphics are produced automatically when one engages in statistical modeling.
As with the Flexplot module, the user only needs to specify the predictor(s) and the outcome and the software will handle the rest. By default, the module will produce a visual of the model (chosen automatically) and report slopes and intercepts (for numeric predictors) and/or means and mean differences. However, the user has a great deal of control over what is displayed.
Model Terms/Visual Modeling
These two sections work hand-in-hand, at least when modeling nonlinear terms, like quadratic or cubic terms. If the user wishes to model a quadratic term, they would select “Quadratic” from the Visual Fitting section. However, simply stating the model is quadratic is not enough; the user must also specify which terms are quadratic. This is specified in the Model Terms section. Likewise, if a user specifies that a term has a polynomial, the software will not actually fit any polynomial terms until the user has specified either cubic or quadratic in the Visual Fitting section.
In short, you need both. To illustrate, see the GIF below.
The Model Terms section also allows the user to specify interaction effects in the same way as other modules in JASP.
Whether specifying a polynomial or an interaction term, both the visuals and the estimates will reflect that.
I am no fan of significance testing and I designed my software to make it impossible to compute a p-value. 🙂
Instead, my module focuses on graphical interpretation and the interpretation of effect sizes. As such, the options in here related to visuals and estimation.
There are two subsections within the Results Displays section. The first is plot, which has the following options:
- checking “Univariate” will produce histograms and/or barcharts for each variable in the model.
- checking “Diagnostics” will produce a histogram of the residuals, a residual dependence plot, and a scale-location plot. These make it easier to assess the assumptions of normality, linearity, and homoscedasticity, respectively.
- checking “Added variable plot” will plot the relationship between the last variable entered (in the GIF below, the “muscle.gain” variable), and the residuals of a model that includes all the other predictors.
Estimates controls which estimates are displayed:
- Show model comparisons: Checking this box will show nested model comparison metrics for each of the predictor variables. This will create a new table that shows semi-partial , semi-partial Bayes Factors, and the inteverted Bayes Factors. Recall that the semi-partial shows how the increases when that particular variable is added to the model. The Bayes Factors compare the full model (i.e., the model with all the predictors) with a model that removes that one variable (or term). Values near one indicate removing that variable has little effect. Values far from one indicate that particular term should not be dropped from the model.
- Report means: Checking this box will display the means of the outcome variable for each level of the grouping variables.
- Show mean differences: Checking this box will display the mean differences between levels the grouping variables. Cohen’s d will also be reported.
- Show slopes/intercepts: Checking this box will report the slope for each numeric variable as well as the intercept.
- Show 95% intervals: Checking this box will report the 95% confidence interval for each estimate.
While the Flexplot submodule is more flexible than the Linear Modeling submodule, the user still has some control over the graphics. As with the Flexplot submodule, the user can specify themes, ghost lines, jittering, and point transparency.
Mixed Modeling Submodule
The Mixed Modeling submodule behaves very similarly to the Linear Modeling Module; the user specifies variables then Flexplot will automatically generate a graphic of the model. However, mixed models allow for the estimation of both random and fixed effects. For example, if we have clustered data (e.g., students nested within different schools), we need to inform the computer of this clustering. That’s where the “Random” box comes in. The user would state that School ID (for example) is the Random variable, then modeling is performed similar to before.
The Model Builder section, however, is slightly different. Rather than having checkboxes for polynomial terms, we have checkboxes for random effects. Those familiar with mixed modeling may recall that every term in a model can be modeled as a fixed effect or as a random effect. Specifying a random effect means that each cluster (e.g., School) may have its own parameter. For example, if we specify that SES is a random effect, that means that each school has a unique slope (from SES to Math). On the other hand, specifying that as a fixed effect means that all schools have the exact same slope. In this module, all intercepts automatically have random effects.
Similar to the Linear Modeling submodule, we can specify whether univariate distributions/diagnostics are displayed. Also, we can specify whether the results display fixed and/or random effects. Finally, the user has some control in how the model graphic is displayed.
Generalized Linear Modeling Submodule
Most statistical models make the assumption that the residuals are normally distributed. That assumption can be problematic in certain situations. For example, if we are modeling a dichotomous outcome and/or a count variable, the standard assumptions will almost surely be violated. In these situations, we can instead tell the computer that the residuals are not normally distributed, but instead that they follow a Poisson distribution, for example.
The “Distribution family” option allows the user to specify how the residuals are distributed. “Logistic” is to perform a logistic regression (for a binary outcome). The link function used is a logit link. “Poisson” and “Negative binomial” are for count data and both utilize a log. (In future iterations of Visual Modeling, I hope to allow the user to choose different link functions). “Gamma” is for continuous, positive skewed distributions, like reaction time and has a inverse link function.
As with the other submodules, we can specify interactions in the “Interaction Terms” section, we can control how plots are displayed in the “Plot Controls” section, and we can ask for univariate plots in the “Results Displays” section. We cannot display residuals, unfortunately, because interpreting residuals for generalized linear models is much less straightforward, though future implementations of the Visual Modeling module should include some basic residual analyses.
The human brain has a very advanced visual pattern recognition system. It is, perhaps, our greatest strength and it would be a shame to persist in not using that to our advantages. The Visual Modeling module is a step in that direction. The four submodules should provide researchers with visualization tools for most statistical modeling situations. It is designed to pair visualization with statistical modeling, providing seamless integration between the two. By so doing, I hope it will prevent embarrassing mistakes and deepen insights into our data that we otherwise might have missed.