This is a continuation of our series on machine learning methods that have been implemented in JASP (version 0.11 onwards).
In this blog post we train a machine learning model to find clusters within our data set. The goal of a clustering task is to detect structures in the data. To do so, the algorithm needs to (1) identify the number of structures/groups in the data, and (2) figure out how the features are distributed in each group. For instance, clustering can be used to detect subgenres in electronic music, subgroups in a customer database, or to identify areas where there are greater incidences of particular types of crime. Similarly, clustering can be used in chemistry to find the number of structural similar compounds, and the distribution of the chemical compounds within each structure. In this blog post, we illustrate clustering with the publicly available and well-known iris data set from UC Irvine Machine Learning Repository.
From Supervised to Unsupervised Learning
The iris data set was published by Ronald Fisher in 1936 to introduce linear discriminant analysis, which is available as “Linear Discriminant Classification” in JASP. The data set contains 150 observations of iris flowers. Each row represents measurements of sepal width, sepal length, petal width, petal length, and under the column “species” you can find the type of iris: Setosa, Versicolor, or Virginica.
To highlight the difference between classification and clustering, recall that classification is a supervised learning task with the goal to find the relationship between features and labels in order to make predictions. To do so, the supervised algorithm takes as input a training set consisting of both features (i.e., sepal width, sepal length, petal width and petal length) and correct labels (i.e., the types “Setosa”, “Versicolor”, or “Virginica”), for further details see for instance our previous blog post.
In contrast, an unsupervised learning algorithm only takes the features as input. Hence, no correct labels are needed for unsupervised learning, and without these labels, there is no relationship to find. Instead, the goal of unsupervised learning is to detect clusters that represent commonalities within the features. These clusters serve as a convenient summary of the data, which can be used for further inference. There are multiple methods for this task, and we now have implemented 5 of them in JASP, namely: “Density-Based Clustering”, “Fuzzy C-Means Clustering”, “Hierarchical Clustering”, “K-Means Clustering”, and “Random Forest Clustering”. We illustrate the underlying ideas of clustering further with the “K-Means Clustering” algorithm.
Follow along using the data file from the Data Library (Open – Data Library – 11. Machine Learning – Iris Flowers – Open data file on the right-hand side).
A First Run of the K-Means Algorithm
To simplify matters, we first run the -means algorithm with the number of clusters fixed at three, thus, . We elaborate on the algorithm based on the output.
To train the algorithm in JASP, select Machine Learning in the ribbon at the top, followed by K-Means Clustering. In the input screen on the left-hand side, open the “Training Parameters” section. Under “Cluster Determination” select “Fixed” and enter 3 next to “Clusters”. To reproduce the results reported in this blog post, check the box “Set seed” and enter 1.
Now go to the top and select all feature variables. Those are Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and press the arrow button next to the “Variables” box. JASP immediately starts computing and the results can be found in Table 1 and 2.
Under the hood K-means fits a model, and Table 1 shows the fit scores for the model with clusters using the data set consisting of 150 cases. The reported is the ratio of the between sums of squares and total sums of squares, which is also typically reported in ANOVA/regression models. A model with an close to the upper bound of one is perceived as a good fitting model, whereas a close to the lower bound of zero indicates that the model fits poorly. The however does not make a distinction between a well-fitted model, and a model that overfits.
The Akaike Information Criterion (AIC, Akaike, 1987) and the Bayesian Information Criterion (BIC, Schwarz, 1978) also measure the goodness-of-fit of a model, but on top of that penalize for the number of free parameters of the model. By penalizing for the number of parameters these information criteria try to safeguard against overfitting. Models that have a lower information criterion are perceived as models that generalize better.
Lastly, the Silhouette score displays the average internal consistency of the clustering by assessing how similar each case is with respect to its own cluster compared to other clusters. For Silhouette scores the general rule is that the closer it is to the upper bound of 1, the more consistent the clustering is, whereas Silhouette scores close to the lower bound of -1 indicate a bad match.
Table 2 shows the sizes of the clusters, the variability within each cluster in terms of the within sum of squares, and the centroids of each cluster in feature space. Note that , which equals the reported in Table 1.
Explaining the Algorithm
With fixed, the K-means algorithm solves an allocation problem, that is, it decides for each of the cases whether it belongs to the cluster “1”, “2”, or “3”. The underlying modelling assumption is that data points that are close to each other (in feature space) belong to the same cluster. The center of such a cluster is referred to as a centroid. For instance, the centroid of the second cluster is at -1.011 in the sepal length dimension, 0.850 in the sepal width dimension, -1.301 in the petal length dimension, and -1.251 in the petal width dimension. Indeed, the feature space is four dimensional.
To find these clusters the K-means algorithm first randomly selects centroids in feature space. It then iterates over two steps: In the first step data points are assigned the same cluster label as its nearest centroid. In the second step, the algorithm recomputes the centroids. For the new centroid of cluster “1” (respectively, “2” or “3”), the algorithm takes the average of all points that were assigned the cluster label “1” (respectively, “2” or “3”). With these new clusters, the algorithm then repeats the assignment and recomputation steps until no data point switches cluster label anymore.
For illustration purposes, we explain the algorithm with a feature space consisting of only two dimensions, say, “Sepal Width” and “Petal Length” with a data set consisting of only 21 cases, which are depicted as circles in Figure 1.
To further ease the discussion we run the algorithm with clusters, and call the first cluster “Blue squares” and the other “Orange triangles”. As starting point of the algorithm, we take two random cases as centroids as shown in Figure 2.
With a first set of centroids at hand, we can now iterate over the two steps. Recall that in the first step we assign data points the same cluster label as its nearest centroid. This results in the assignment shown in Figure 3.
For the second step, we now compute new centroids. By taking the average of all blue squares, we get the new centroid as shown in Figure 4. Figure 4 also shows the old centroids as black crosses. Note that the new centroids are not cases of our data set.
Figure 5 is the same as Figure 4, but with the crosses removed.
The algorithm is now ready to iterate over the two steps again. For the assignment step, we now check whether any data point gets a new label. Figure 6 shows that the second to the left data point changes cluster label from orange triangle to blue square..
Based on these new cluster labels the centroids are recalculated again, and Figure 7 shows that the “Blue squares” centroid moves to the south west, whereas the “Orange triangles” centroid moves east.
Figure 8 is the same as Figure 7, but with the old centroids removed.
With the centroids in Figure 8, we can once again assign the cases new cluster labels, but because all the cluster labels remain the same, the algorithm terminates. Figure 9 brings all the steps together.
The process with all four features is similar, but hard to visualize. To nonetheless get an idea of how the data clump together in a high-dimensional feature space, van der Maaten and Hinton (2008) developed the t-distributed stochastic neighbor embedding algorithm. This algorithm is used to produce a “t-SNE cluster plot”, which can be found under “Plots”, and the result is shown in Figure 10.
The t-SNE algorithm non-linearly projects (with high probability) similar data points in the higher dimensional feature space to points that are close to each other in the plane, and dissimilar data points in the higher dimensional feature space to points that are far apart from each other in the plane.
The Training Parameters Section
The algorithm explained above is known as the Lloyd-Forgy method, but the results shown for the iris data set are run using the Hartigan-Wong algorithm, which is selected by default. The algorithms differ in details; they do other things when confronted with ties, and they optimize differently. Furthermore, some of the algorithms might be more sensitive to the starting centroids. All these settings can be adjusted in the “Training Parameters” section under “Algorithmic Settings”.
By default we advice to have the variables scaled, because otherwise, the clustering algorithm will pick up differences due to variables spanning different ranges. For instance, if some variables are depicted in kilometers and others in meters (or even seconds), without scaling, the clustering output would be adversely influenced.
The Number of Clusters
In general, we don’t know the number of clusters within a data set. To let the algorithm find the number of clusters, we open the “Training Parameters” section again and now choose “Optimized according to”, which by default uses BIC as the measure to select with. Further insights can be gained by checking “Elbow method” under “Plots” and results in Figure 11.
As noted above, the information criteria AIC and BIC are measures of model fit, and internally JASP fits a model with -means, -means, up to -means. For each model the AIC, BIC and Silhouette scores are computed, and the model with has the lowest BIC score, which is why this model appears in the main table, see Table 3 below.
Figure 11 also shows the problem of –the within sums of squares decreases monotonically as increases, thus, goes to one as grows. Hence, selecting based on is futile, as we would then have each data point in its own cluster, which defeats the purpose of finding a summary of the data.
Figure 11 also shows that we would choose larger if we were to optimize with respect to the AIC instead. More precisely, this leads to a model with .
Lastly, when selecting Silhouette under “Optimized according to” in the “Training Parameters” section we then get . In this case, this also corresponds to the model with the highest within sums of squares, but that’s a mere coincide.
The methods show three candidates for the number of clusters, namely, 5 when optimized with respect to BIC, 8 for AIC, and 2 for Silhouette, and one might wonder what the correct number of is. Based on the fact that the data set contains measurements of three types of irises, one might claim that would be correct. This argument, however, presupposes that a classification task is the same as clustering, which is not the case. The goal of clustering is finding structures that adequately summarize the data. Based on the features alone, one cannot infer that there should be cluster. Furthermore, for this data set it is generally agreed upon that there are clusters, see also the t-SNE plot.
The truth of the matter is that for an unsupervised learning method, there are no correct clusters –only useful clusters. To decide whether a clustering is useful, one should use the clusters in a follow-up analysis. If the cluster information helps predict better in a follow-up task then it was useful. To extract the cluster information in JASP, open the “Training Parameters” section and select “Add predicted clusters to data” and enter a column name. This will add an extra column in the data set, which can be used in other analyses, for instance, as a split by factor in Descriptives.
This post is part of a three-part series on machine learning, be sure to also check out our previous post on classification, and the upcoming post on regression!
Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52(3), 317–332.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Van der Maaten, L. J. P., Hinton, G. E. (2008). Visualizing Data Using t-SNE. Journal of Machine Learning Research. 9: 2579–2605.