Released last week, JASP 0.9 contains a feature that many users have eagerly anticipated: filtering. This post demonstrates how to use the new filtering functionality by going over several examples.
The filtering functionality in JASP has three interfaces: the Click Filter, the Drag and Drop Filter and the R Filter. To illustrate their use, we will analyze a couple of data sets from the JASP Data Library. To access the Data Library, see here.
The Click Filter
The click filter is a simple filtering interface for categorical variables. It can be accessed by clicking on the column header of the variable that is to be filtered.
For our first example, we will use the data set “Kitchen Rolls” (Data Library -> Kitchen Rolls) to show how to filter according to constraints on one or more categorical variables. Suppose we want to include only female students in our analysis. “Kitchen Rolls” contains two variables that are relevant in this respect: the variable
Sex contains the sex of each participant (denoted by ‘M’ and ‘F’), and the variable
Student contains the information whether a participant is a student or not (denoted by ‘Y’ and ‘N’).
First click on the header of the
Sex column. To remove all male participants from the analysis, turn off the ‘M’ value by clicking on the check mark next to it. The cross implies that the data from the male participants will not be used in the analysis. We can immediately see that all rows that have an ‘M’ under sex are greyed out and turned inactive. Closing this window and opening the variable window of
Student, we can turn off
N in order to exclude non-students. Now, we see that only rows that contain both
Y in the respective columns are active. To erase all filters created in the Click Filter, click the rubber icon on the right hand side.
The Drag & Drop Filter
The Drag and Drop Filter lets you create filters via simple dragging-and-dropping. Once a data set or .jasp-file has been loaded into JASP, the Drag and Drop Filter can be accessed by clicking the filter icon in the top left corner of the data view.
Filtering One or More Categorical Variables
Let’s first create the same filter as in the previous example, now using the Drag and Drop Filter.
After loading the data file from the Data Library, we access the Drag and Drop Filter as shown above. We then drag the variable
Sex from the left menu into the box, followed by
=. Note that you can also add variables or operators by simply clicking on them. Next, we click on the empty right-hand side of the equation, type in the text ‘F’, and press enter. We can now click Apply pass-through filter and we see that only the rows that contain an ‘F’ in the
Sex column remain active. In order to get all female students, we add another formula below by dragging
Student into the box, followed by
=. We then type in ‘Y’ and press enter. After applying the filter, we see that the only active rows are those that obey both constraints we expressed above. In the same way, for instance, we may also insert
≠ into the filter, activating only rows that are not equal to a given value on a given variable. To erase all filters created in the Drag and Drop Filter, double-click the trash can icon and apply.
Filtering a Continuous Column
Next, we will examine how to filter data according to an equality or inequality constraint on a continuous column — that is, using operators such as equal to or greater / less than on a variable of the ‘scale’ type. For this example, we will use the data set ‘Presents’ Height’ (Data Library -> Miscellaneous), which contains, for each US president, the height ratio (i.e., their height versus that of their closest competitor) and their proportion of the popular vote.
Suppose we want to filter out all outliers from the continuous variable ‘Heights Ratio’. One way to do that is to go to Descriptives, click Descriptive Statistics, and request a Boxplot under Plots. Clicking Label Outliers under Boxplots, we see that there is one outlier in the data, namely the one located in row number 10. Your JASP window should look like this:
Having found out where the outlier is located, we can now go ahead and filter it out. To do that, first click OK to leave the descriptives menu. Scrolling down to row number 10, we see that this president has a heights ratio of 1.18405. We also see that the variable
V1 reflects the number of the row. This is convenient, because it means that we can filter on this variable, not having to memorize the precise heights ratio of this president.
After accessing the Drag and Drop Filter, we drag the variable
V1 into the box. We now want to tell the filter to retain all values of this variable aside from 10. Thus, we drag a
≠ next to
V1. We click on the empty space to the right of the
≠, fill in ‘10’, and press enter. After clicking Apply pass-through filter we see that row 10 has been filtered out. Of course, we may also filter directly through the variable
Heights Ratio by applying the formula
Heights Ratio ≠ 1.18405.
Another way would be to create a lower-/greater than formula. The heights ratio of the outlier is 1.18405, but the box plot tells us that there are no other outliers above 1.15. That means that the formula
Heights Ratio < 1.15 will also work, telling the filter to retain all values that are lower than 1.15.
Filtering Using more Complex Formulas
When trying to filter out outliers, the methods discussed above will not always prove to be practical. For instance, consider a situation where we have several outliers instead of just 1. It would be quite some work to create separate formulas for all these outliers. Instead, let us create two formulas that get rid of all outliers at once, without even having to look into descriptives beforehand.
fuseTime in the data set ‘Stereograms’ (Data Library -> T-Tests) contains several outliers. Given the definition of an outlier as an observation that is at least 2 standard deviations away from the mean, we can create two formulas that systematically filter these kinds of observations out.
We want to tell the filter to only let values of
fuseTime pass if they are lower than the mean of that variable plus two standard deviations. To do that, we enter
fuseTime, followed by the operator
<. We then add
mean from the menu on the right. We drag another instance of
fuseTime into the ‘values’ field of
mean, and add a
+. Now, we click on
*, and add a ‘2’ on the left and a
σ -- the symbol for standard deviation -- on the right side. Lastly, we again drag an instance of
fuseTime into the ‘values’ field of
σ. In order to also get rid of the outliers on the other end of the distribution, we add the same formula again, this time with the operator
> and a
- instead of the
+. After application, 3% of observations are filtered out.
Split-by filtering can be used to apply a filter separately for all groups in a categorical variable. For instance, instead of removing outliers with respect to the overall mean and standard deviation, we might be interested in removing the outliers within each group that a data set contains. Continuing the previous example, we now want to filter out observations that are outliers with respect to the mean and standard deviation of their respective group.
To filter out the outliers per group, we simply add the split-by operator
| behind the two formulas that we created in the previous example, and drop the variable ‘condition’ behind it. After applying the filter, we see that 6% of observations are filtered out.
For another example of split-by filtering, let us look at the data set ‘Sleep’ (Data Library -> Descriptives). Consider a situation where we only want to analyze the participants who scored lower than the median of their group on the variable
To do this, we enter
extra, followed by
<. We then add
median and drop another instance of
extra into the ‘values’ field of
median. As of now, our filter formula includes all participants who scored below the overall median. In order to filter out all participants who scored lower than the median of their respective group, we add the split-by operator
|, and drop
group behind it. After applying the filter, we see that 10 observations remain active. Those are the bottom 50% of group 1 and the bottom 50% of group 2.
The R Filter
The R Filter can be accessed by clicking the ‘R’ symbol in the bottom left corner of the Drag and Drop Filter. Using the R Filter gives you the opportunity to write custom filters for your data that are not available through the Drag and Drop Filter. Furthermore, you may also combine filters that you create in R mode with those selected in the variable window or made in the Drag and Drop Filter. In fact, you can easily observe the code that is generated by those other filters in the top read-only textbox that starts with
generatedFilter <-. For instance, leaving the filter created in the previous Drag and Drop example activated, switching into R mode yields the following screen (displaying the created filter in the form of R code at the top):
To refer to your data simply enter the name of the targeted column in the code-window. JASP will make sure that it refers to the correct data. If your column name contains spaces you must enter these as well. Below you can find a couple of examples, which obviously only work if your data contains those columns. To erase all filters created in the R Filter, click the rubber icon on the bottom of the interface and apply.
Applying the opposite of what has previously been constructed via the drag & drop filter or the variable window
Gender == "Female" & TestScore > 5
TestScore while taking into account filters that have previously been constructed via the drag & drop filter or the variable window
generatedFilter & Gender == "Female" & TestScore > 5
Age split by
(mean(Age) > Age) %|% Sex
%|% is a JASP-specific R-operator that makes sure the code to the left of it is run separately for each group from the variable to its right. In the above case it means that the filter passes all rows on which Age is lower than the mean Age for that specific Sex. Be sure to add the parentheses around the exact expression that you wish to be conditioned. For instance, the following would not work as expected:
mean(Age) > Age %|% Sex, as this would just condition Age on Sex, while leaving mean(Age) to be calculated for the whole column.
While the Drag and Drop Filter is obviously limited by the operators and function that it provides, the possibilities using the R mode filter are virtually endless. Any filter expressed as R code will work in JASP’s R Filter.