This is a guest post by Tom Faulkenberry (Tarleton State University). Click here to access the supplementary materials.
Amid the COVID-19 pandemic, universities have needed to quickly adjust their traditional methods of instruction to allow for maximum flexibility. This means that professors have also had to think critically about how they can best deliver instruction in new formats. Ever the optimist, I decided to make the best of the situation and repackage my experience with teaching statistics during a pandemic into a lesson on how to do Bayesian linear regression in JASP.
Before jumping in, I’ll need to provide some background. At my university, we opted to follow the “HyFlex” model of instruction, where instructors teach their courses in a face-to-face format, but the lectures are simultaneously streamed online and recorded. This gives students three options for attendance — they can choose to attend (1) face-to-face; (2) remote synchronous; or (3) remote asynchronous. With the last two options, students are “attending” the course from a remote location, but they still must choose whether to log in and participate during the scheduled time of lecture (synchronous) or watch the pre-recorded lectures at a different time (asynchronous).
As one might imagine, this new freedom of choice afforded to thousands of our university students meant that our lecture halls quickly became quite empty. While a few intrepid souls regularly attended their face-to-face classes (proudly wearing their masks), many opted for remote attendance. Even more opted for asynchronous remote attendance…after all, if I can watch the lecture whenever I want, why watch at 8:00 in the morning? Quickly, our faculty and administration picked up on this pattern and began to notice that students weren’t performing as well as they should, especially among these asynchronous attenders. Clearly, we should act now to remove this asynchronous option for next semester…right?
Well, maybe, but I think we should collect some data first. So that’s exactly what I did. My aim in this blog post is to walk the reader through how I used Bayesian linear regression to answer the following question: Do my students’ course grades depend on whether they attend lectures synchronously or asynchronously?
By the way, if you’re impatient, the answer is “no”. But that’s not the whole story, and if you just stop here, you’ll miss a rich discussion of lots of Bayesian concepts.
Before moving forward, I need to provide an important disclosure — the data I’m about to share and report were not systematically collected with the purpose of confirming any specific hypotheses about the effects of attendance mode on course grade. Instead, my data are convenient and my analyses are purely exploratory. Second, these data in no way reflect any patterns that might be observed across my university — these data are from one statistics course that I taught. Nonetheless, the data might still teach us something.
OK, let’s talk about the data. I collected some course performance data from 33 students in my first-year statistics course. These anonymous data can be downloaded here. I included the final course grade (on a scale of 100 points) for each student. I also categorized each student as a synchronous student or an asynchronous student. I did this by counting the number of lectures attended by each student throughout the semester. Any student who attended at least 75% of lectures (either in person or remotely) was categorized as synchronous (sync = 1), and everyone else was categorized as asynchronous (sync = 0).
Before going further, let’s consider that there’s probably quite a bit of difference in the course grades between synchronous and asynchronous attenders (that’s certainly the impression that our faculty and administration have gotten so far). Indeed, that seems to be the case with these data. If you click the “Descriptives” button, move grade to the “Variables” list, and split by sync (note that you’ll need to change sync to a nominal variable to do this), we get the table below:
As we can see, there is a 15 point advantage for the synchronous attenders (sync = 1) compared to the asynchronous attenders (sync = 0). But look at those standard deviations! There is much more variation among the asynchronous attenders, so clearly something else is going on.
Fortunately, I had some additional data that might explain some of this variability. In conversations with my students this semester, it became clear that some of my asynchronous students were not actually watching the recorded lecture videos. Since video viewing times for each student were available from our learning management system, I was able to figure out how many minutes of each recorded lecture video that each student watched. From these data, I computed the average length of time that each student watched the lectures during the semester. This mean (standardized to a maximum of 75 minutes) is recorded in the variable avgView.
Thus, I now have two important variables that might contribute to some of the variability in course grades. One way to better understand this relationship is to perform a Bayesian linear regression, which we can easily do in JASP. The JASP file containing these analyses can be downloaded here.
Performing a Bayesian linear regression
I would like to know the extent to which sync and avgView predict course grade. Bayesian linear regression lets us answer this question by integrating hypothesis testing and estimation into a single analysis. First, these two predictors give us four models that we can test against our observed data. Once we’ve chosen the best model (i.e., the one that best predicts the observed data), we can then use the models to estimate the impact of each predictor.
Let’s now describe the four models:
Model 1: grade ~ sync + avgView
- This model hypothesizes that a student’s course grade is impacted both by their attendance (synchronous versus asynchronous) AND the average amount of time that the student spent watching the lectures.
Model 2: grade ~ sync
- Compared to Model 1, this model drops average viewing time as a predictor, and thus hypothesizes that course grade is impacted by attendance mode, but NOT the average lecture viewing time.
Model 3: grade ~ avgView
- Compared to Model 1, this model drops attendance mode as a predictor, and thus hypothesizes that course grade is impacted by average lecture viewing time, but NOT attendance mode.
Model 4: Null model
- This model hypothesizes that neither attendance mode nor average lecture viewing time predicts course grade.
Our first task is to determine which of these models is best supported by the observed data. In JASP, we click on the “Regression” button and select “Bayesian Linear Regression”. We’ll move grade into the “Dependent Variable” box, and we’ll move our two predictor variables sync and avgView into the “Covariates” box. Additionally, we’ll select “Posterior Summary” under “Output” and “Marginal posterior distributions” in the “Plots” menu (see the figure below). Finally, for ease of explanation in the next section, I selected “Uniform” under “Model Prior” in the “Advanced” menu.
With these options, JASP produces three main outputs — (1) a model comparison table; (2) a posterior summary table; and (3) plots of the marginal posterior distributions for each model coefficient. Let’s now discuss each of these:
Output 1 – the model comparison table
The model comparison table tells us which of the four models displays the best predictive adequacy — that is, which model does the best job of predicting the observed data. By default, the models are listed in order from most predictive to least predictive. From the table below, we can see immediately that our data are most likely under the model containing only average viewing time as a predictor. Let’s take a closer look at why this is the case.
The first two columns are P(M) and P(M|data). P(M) denotes the prior probability of each model. Since we chose “Uniform” under “Model Prior” in the advanced options, each of these models is assumed to be equally likely before observing data. The column labeled P(M|data) contains the posterior probability of each model — that is, after observing data. The best fitting model, containing only avgView as a predictor, has a posterior probability of 0.746, whereas the next best fitting model (sync + avgView) has a posterior probability of 0.220. The remaining two models account for a combined posterior probability of 0.023 + 0.011 = 0.034 — these two models are not very likely at all.
The next two columns are BFM and BF10. As one might guess, these are both Bayes factors, but they are slightly different types of Bayes factors. BFM is a Bayes factor on the model odds — that is, it is the factor by which the odds in favor of a specific model increase after observing data. Let’s work through an example to make this a bit more clear. Consider the best fitting model, containing only avgView. Before observing data, the odds in favor of this model are 1-to-3. We can see this by dividing the probability of the model (0.250) by the probability of all other models (0.250 + 0.250 + 0.250 = 0.750) — that is, 0.250 / 0.750 = 0.333 (after all, “odds” is just a ratio of two probabilities). How do these odds shift after observing data? Let’s compute the posterior odds for the avgView model: 0.746 / (0.220 + 0.023 + 0.011) = 2.937. If we divide these posterior odds (2.937) by the prior odds (0.333), we get the updating factor of BFM = 8.822. We interpret this number in the following way: “After observing data, my odds in favor of the model containing only average viewing time as a predictor have increased by a factor of 8.822”.
On the other hand, BF10 gives the relative predictive adequacy of the given model compared to the best fitting model. Notice that the second best fitting model (sync + avgView) has BF10 = 0.295. This means that the observed data are 0.295 times as likely to occur under this two predictor model than they are under the avgView model. Taking reciprocals (1 / 0.295 = 3.389), we can interpret this more easily as: “The observed data are 3.389 times more likely under the model containing only average viewing time as a predictor compared to the model that also specifies whether the student is a synchronous or asynchronous attender.”
So what does this all mean? One conclusion we may draw is that average lecture viewing time is clearly a predictor of course grade, because the posterior probability of including it in the model is 0.746 + 0.220 = 0.966. Now our question shifts to the following: Does it matter whether a student attends synchronously or asynchronously? To answer this question, we need compare the model containing both predictors to the model containing only average viewing time — we can use both of our obtained Bayes factors to make this comparison. First, only avgView received increased support after observing data (the model odds were increased by a factor of 8.822) — all other models received decreased support. Additionally, compared to sync + avgView (where attendance mode matters), the data are 3.389 times more likely under the single predictor model avgView. So, does attendance mode matter? Given these data, I would argue that the answer is no. Instead, it is the average lecture viewing time that best predicts course grades.
Now that we’ve established that average lecture viewing time matters, the next step is to estimate its impact. How much gain in course grade can I expect for each additional minute of average viewing time? To answer this, we need to look at the next outputs in JASP.
Outputs 2 and 3 — the posterior summary table and marginal posterior distributions
The posterior summary table provides information about each possible predictor in the linear regression model. Here is the one from our analysis:
Roughly, the posterior summary table consists of two parts. The first part (including all columns to the left of and including BFinclusion) helps us determine whether to include each possible predictor in the model. The second part (including the remaining columns to the right) tells us about the coefficients of each predictor. But there is so much more going on here — and it all deals with uncertainty. Let’s look deeper.
Recall from our earlier discussion of the model comparison table that we have uncertainty about which model best predicts our observed data. Certainly, we believe that the model with the single predictor of avgView is best, but there is also a small probability that the two-predictor model is the right one. Since my goal is to inform my own future policy about permitting asynchronous attendance, I would like to know which predictors I should include in the model. JASP helps answer this using Bayesian model averaging, which combines the evidence for including a particular predictor by averaging across the models which contain that predictor. Here’s how it works. The prior probability of including the variable sync in our model is 0.5 — this is because 2 of the 4 models include sync. Similarly, the prior probability of including avgView is also 0.5. After observing data, these prior probabilities are updated to posterior probabilities. The posterior probability of including sync now falls to 0.243 — this number comes from adding the posterior probabilities for the two models containing sync (i.e., 0.220 + 0.023 = 0.243). Another way to say this is that the posterior probability of excluding sync is 1 – 0.243 = 0.757.
On the other hand, the posterior probability of including avgView increases to 0.966. Converting these inclusion probabilities to inclusion odds (as above), we can divide the posterior inclusion odds by the prior inclusion odds to get the inclusion Bayes factor. Including avgView in the model produces BFinclusion = 28.817. This means that the data have increased our prior odds for including avgView as a predictor by a factor of 28.817 — strong evidence for including avgView in the model. On the other hand, the data have decreased our prior odds for including sync by a factor of 1 / 0.321 = 3.11. Based on this evidence, I will choose to only include average viewing time as a predictor of course grade (and leave out attendance mode).
So what does this have to do with estimating the impact of average viewing time? A Bayesian analysis provides not only a point estimate for each predictor’s coefficient (the column labeled “Mean”) — it also captures uncertainty via a 95% credible interval. But, It is important to note that any estimate we make is conditional on the underlying model. For example, the estimate of the effect of avgView will be different under the single predictor model than under the two predictor model that also includes sync. So, we have uncertainty in two places — uncertainty in the estimate itself AND uncertainty in the model choice. Bayesian model averaging provides an elegant solution to this problem. The 95% credible intervals that we see for each coefficient in the table reflect a weighted average where each estimate is weighed by the posterior probability of including that specific predictor in the model. Thus, the resulting credible intervals account not only for uncertainty within the model, but also uncertainty across the models. This averaging becomes apparent when we look at the marginal posterior distribution plots (below).
From the table we can see that the coefficient of avgView has a posterior mean of 0.394. This means each additional minute of watching the recorded lecture videos improves course grade by an average of 0.394 points. Said differently, every additional 25 minutes of average viewing time improves course grade by 10 points (a “letter grade” in the US grading system). The model averaged credible interval tells us that this coefficient is 95% probable to be between 0.000 and 0.616. Notice the small “spike” at 0 on the left tail of the marginal posterior distribution plot — this spike reflects the (albeit small) probability of 0.0335 of excluding avgView as a predictor.
On the other hand, consider the marginal posterior distribution for the coefficient of sync. Even though the table gives us an estimate, there is a large spike at 0 for sync. This reflects the large probability (0.757) of excluding sync as a predictor in the model. If sync is included in the model (the probability of including it is 0.243), it is 95% probable that the effect of synchronous attendance is between -8.54 points and +12.16 points. Clearly we do not see a consistent effect of synchronous attendance. In fact, I would argue that we have positive evidence for the absence of any such effect.
Summary
In this blog post, I have given you a tour of Bayesian linear regression in JASP. There is much more to discuss — for more details, I recommend you read this excellent preprint by Don van den Bergh and colleagues. Additionally, you can see a tutorial example of using Bayesian linear regression with JASP in our recently-published paper in the Journal of Numerical Cognition.
Oh, and what about attendance mode for my first year statistics students? Based on this analysis, I will continue allowing them to attend the course asynchronously — but I’ll certainly push them to watch the recorded lectures!
References
van den Bergh, D., Clyde, M. A., Raj, A., de Jong, T., Gronau, Q. F., Marsman, M., Ly, A., and Wagenmakers, E.-J. (2020). A Tutorial on Bayesian Multi-Model Linear Regression with BAS and JASP. Preprint available on PsyArXiv: https://psyarxiv.com/pqju6/
Faulkenberry, T. J., Ly, A., & Wagenmakers, E.-J. (2020). Bayesian inference in numerical cognition: A tutorial using JASP. Journal of Numerical Cognition, 6(2), 231-259. https://doi.org/10.5964/jnc.v6i2.288