JASP 0.11 has been released and is now available on our download page. This version adds the Machine Learning module with 13 brand new analyses that can be used for supervised and unsupervised learning. With supervised learning, the goal is to predict a target variable by learning from existing labeled data. The goal of unsupervised learning, on the other hand, is to look for underlying patterns/structures in unlabeled data.
For supervised learning, the Machine Learning module differentiates between regression and classification purposes. A supervised learning problem tries to find the underlying relationship between a target variable and possibly many predictor variables. When the target is continuous (e.g., housing prices) we use a regression method to model this relation, and when the target is nominal (e.g., “deceased” or “survived”) we use a classification method. The Machine Learning module contains the following four analyses for regression:
- Boosting Regression
- K-Nearest Neighbors Regression
- Random Forest Regression
- Regularized Linear Regression
The module also provides the following four analyses for classification:
- Boosting Classification
- K-Nearest Neighbors Classification
- Linear Discriminant Classification
- Random Forest Classification
We also included methods for unsupervised learning (clustering) in the module. The following five analyses for clustering are provided:
- Density-Based Clustering
- Fuzzy C-Means Clustering
- Hierarchical Clustering
- K-Means Clustering
- Random Forest Clustering
All analyses contain options to manually set the relevant training parameters and automatically optimize the most important ones. Supervised analyses have an additional tab (Data Split Preferences) allowing users to change the way the data are used for learning and prediction.
Data Split Preferences for Supervised Learning
Roughly speaking, to learn the relationship between a target variable and its predictor variables using a machine learning algorithm, two questions need to be addressed: (1) What is the form of the relationship between the target variable and its predictor variables? (2) What are the optimal parameters for the algorithm? Once a relationship between the target and the predictors has been learned and fine-tuned, a third question arises: (3) How well does the learned relationship predicts data that it has never seen before, and how (un)certain are we about these predictions?
Figure 1. Data split output in the Machine Learning module of JASP 0.11.
Questions (1) and (2) are addressed by training the algorithm on part of the data, the training set, and optimizing its performance on a different part of the data, the validation set. Question (3) is addressed by using the trained algorithm to make predictions on the remaining part of the data, the test set. In the Data Split Preferences tab, it is possible to specify the percentage of data used for the hold-out test set, the training, and the validation set. In the background, R then randomly partitions the data according to these preferences. This is nicely illustrated in the output as well, as shown in Figure 1.
The percentage of the hold-out test set only tells you the size of the partition, but not which rows from the data set it includes. The cases belonging to the randomly held-out test set can be identified by adding the generated test indicator to the data. A “1” indicates that the row is included in the test set, whereas a “0” implies that it is excluded. The analysis can then be re-run using the same test set. Alternatively, a pre-specified test set containing representative data could be used to generalize predictions of the model.
To further ease reproducibility and comparison of analyses, we included the possibility to fix a random seed. This option is turned off by default because machine learning techniques are inherently random. If turned on, it is possible to recreate the analysis results for a particular dataset with the same set seed.
In this version, there is not yet a possibility to save a trained model and use it to make predictions for an entirely new data set. However, this is high on the priority list and will be featured in the next version of the module.
Be sure to keep your eye on this website, as the Machine Learning team will publish multiple blog posts where they will elaborate on the theory and practice of regression, classification, and clustering using the Machine Learning module in JASP.
For a complete list of features, or to get started with JASP 0.11 right away, head over to our download page.