How to Filter Your Data in JASP

Released last week, JASP 0.9 contains a feature that many users have eagerly anticipated: filtering. This post demonstrates how to use the new filtering functionality by going over several examples.

The filtering functionality in JASP has three interfaces: the Click Filter, the Drag and Drop Filter and the R Filter. To illustrate their use, we will analyze a couple of data sets from the JASP Data Library. To access the Data Library, see here.

The Click Filter

The click filter is a simple filtering interface for categorical variables. It can be accessed by clicking on the column header of the variable that is to be filtered.

For our first example, we will use the data set “Kitchen Rolls” (Data Library -> Kitchen Rolls) to show how to filter according to constraints on one or more categorical variables. Suppose we want to include only female students in our analysis. “Kitchen Rolls” contains two variables that are relevant in this respect: the variable Sex contains the sex of each participant (denoted by ‘M’ and ‘F’), and the variable Student contains the information whether a participant is a student or not (denoted by ‘Y’ and ‘N’).

First click on the header of the Sex column. To remove all male participants from the analysis, turn off the ‘M’ value by clicking on the check mark next to it. The cross implies that the data from the male participants will not be used in the analysis. We can immediately see that all rows that have an ‘M’ under sex are greyed out and turned inactive. Closing this window and opening the variable window of Student, we can turn off N in order to exclude non-students. Now, we see that only rows that contain both F and Y in the respective columns are active. To erase all filters created in the Click Filter, click the rubber icon on the right hand side.

The Drag & Drop Filter

The Drag and Drop Filter lets you create filters via simple dragging-and-dropping. Once a data set or .jasp-file has been loaded into JASP, the Drag and Drop Filter can be accessed by clicking the filter icon in the top left corner of the data view.

Filtering One or More Categorical Variables

Let’s first create the same filter as in the previous example, now using the Drag and Drop Filter.

After loading the data file from the Data Library, we access the Drag and Drop Filter as shown above. We then drag the variable Sex from the left menu into the box, followed by =. Note that you can also add variables or operators by simply clicking on them. Next, we click on the empty right-hand side of the equation, type in the text ‘F’, and press enter. We can now click Apply pass-through filter and we see that only the rows that contain an ‘F’ in the Sex column remain active. In order to get all female students, we add another formula below by dragging Student into the box, followed by =. We then type in ‘Y’ and press enter. After applying the filter, we see that the only active rows are those that obey both constraints we expressed above. In the same way, for instance, we may also insert into the filter, activating only rows that are not equal to a given value on a given variable. To erase all filters created in the Drag and Drop Filter, double-click the trash can icon and apply.

Filtering a Continuous Column

Next, we will examine how to filter data according to an equality or inequality constraint on a continuous column — that is, using operators such as equal to or greater / less than on a variable of the ‘scale’ type. For this example, we will use the data set ‘Presents’ Height’ (Data Library -> Miscellaneous), which contains, for each US president, the height ratio (i.e., their height versus that of their closest competitor) and their proportion of the popular vote.

Suppose we want to filter out all outliers from the continuous variable ‘Heights Ratio’. One way to do that is to go to Descriptives, click Descriptive Statistics, and request a Boxplot under Plots. Clicking Label Outliers under Boxplots, we see that there is one outlier in the data, namely the one located in row number 10. Your JASP window should look like this:

Having found out where the outlier is located, we can now go ahead and filter it out. To do that, first click OK to leave the descriptives menu. Scrolling down to row number 10, we see that this president has a heights ratio of 1.18405. We also see that the variable V1 reflects the number of the row. This is convenient, because it means that we can filter on this variable, not having to memorize the precise heights ratio of this president.

After accessing the Drag and Drop Filter, we drag the variable V1 into the box. We now want to tell the filter to retain all values of this variable aside from 10. Thus, we drag a next to V1. We click on the empty space to the right of the , fill in ‘10’, and press enter. After clicking Apply pass-through filter we see that row 10 has been filtered out. Of course, we may also filter directly through the variable Heights Ratio by applying the formula Heights Ratio ≠ 1.18405.

Another way would be to create a lower-/greater than formula. The heights ratio of the outlier is 1.18405, but the box plot tells us that there are no other outliers above 1.15. That means that the formula Heights Ratio < 1.15 will also work, telling the filter to retain all values that are lower than 1.15.

Filtering Using more Complex Formulas

When trying to filter out outliers, the methods discussed above will not always prove to be practical. For instance, consider a situation where we have several outliers instead of just 1. It would be quite some work to create separate formulas for all these outliers. Instead, let us create two formulas that get rid of all outliers at once, without even having to look into descriptives beforehand.

The variable fuseTime in the data set ‘Stereograms’ (Data Library -> T-Tests) contains several outliers. Given the definition of an outlier as an observation that is at least 2 standard deviations away from the mean, we can create two formulas that systematically filter these kinds of observations out.

We want to tell the filter to only let values of fuseTime pass if they are lower than the mean of that variable plus two standard deviations. To do that, we enter fuseTime, followed by the operator <. We then add mean from the menu on the right. We drag another instance of fuseTime into the ‘values’ field of mean, and add a +. Now, we click on *, and add a ‘2’ on the left and a σ -- the symbol for standard deviation -- on the right side. Lastly, we again drag an instance of fuseTime into the ‘values’ field of σ. In order to also get rid of the outliers on the other end of the distribution, we add the same formula again, this time with the operator > and a - instead of the +. After application, 3% of observations are filtered out.

Split-By Filtering

Split-by filtering can be used to apply a filter separately for all groups in a categorical variable. For instance, instead of removing outliers with respect to the overall mean and standard deviation, we might be interested in removing the outliers within each group that a data set contains. Continuing the previous example, we now want to filter out observations that are outliers with respect to the mean and standard deviation of their respective group.

To filter out the outliers per group, we simply add the split-by operator | behind the two formulas that we created in the previous example, and drop the variable ‘condition’ behind it. After applying the filter, we see that 6% of observations are filtered out.

For another example of split-by filtering, let us look at the data set ‘Sleep’ (Data Library -> Descriptives). Consider a situation where we only want to analyze the participants who scored lower than the median of their group on the variable extra.

To do this, we enter extra, followed by <. We then add median and drop another instance of extra into the ‘values’ field of median. As of now, our filter formula includes all participants who scored below the overall median. In order to filter out all participants who scored lower than the median of their respective group, we add the split-by operator |, and drop group behind it. After applying the filter, we see that 10 observations remain active. Those are the bottom 50% of group 1 and the bottom 50% of group 2.

The R Filter

The R Filter can be accessed by clicking the ‘R’ symbol in the bottom left corner of the Drag and Drop Filter. Using the R Filter gives you the opportunity to write custom filters for your data that are not available through the Drag and Drop Filter. Furthermore, you may also combine filters that you create in R mode with those selected in the variable window or made in the Drag and Drop Filter. In fact, you can easily observe the code that is generated by those other filters in the top read-only textbox that starts with generatedFilter <-. For instance, leaving the filter created in the previous Drag and Drop example activated, switching into R mode yields the following screen (displaying the created filter in the form of R code at the top):

To refer to your data simply enter the name of the targeted column in the code-window. JASP will make sure that it refers to the correct data. If your column name contains spaces you must enter these as well. Below you can find a couple of examples, which obviously only work if your data contains those columns. To erase all filters created in the R Filter, click the rubber icon on the bottom of the interface and apply.

Applying the opposite of what has previously been constructed via the drag & drop filter or the variable window


Filtering on Gender and TestScore

Gender == "Female" & TestScore > 5

Filtering on Gender and TestScore while taking into account filters that have previously been constructed via the drag & drop filter or the variable window

generatedFilter & Gender == "Female" & TestScore > 5

Filtering on Age split by Sex

(mean(Age) > Age) %|% Sex

The operator %|% is a JASP-specific R-operator that makes sure the code to the left of it is run separately for each group from the variable to its right. In the above case it means that the filter passes all rows on which Age is lower than the mean Age for that specific Sex. Be sure to add the parentheses around the exact expression that you wish to be conditioned. For instance, the following would not work as expected: mean(Age) > Age %|% Sex, as this would just condition Age on Sex, while leaving mean(Age) to be calculated for the whole column.

While the Drag and Drop Filter is obviously limited by the operators and function that it provides, the possibilities using the R mode filter are virtually endless. Any filter expressed as R code will work in JASP’s R Filter.


Like this post?

Subscribe to our newsletter to receive regular updates about JASP including our latest blog posts, JASP articles, example analyses, new features, interviews with team members, and more! You can unsubscribe at any time.

About the author

Tim Draws

Tim Draws is a PhD candidate in the Web Information Systems group at Delft University of Technology. At JASP, he is contributing to the Machine Learning Module.