Binary Classification

Imagine that you recently started to show COVID-19 symptoms. Naturally, you are worried and decide to have a diagnostic test. Unfortunately, the test result is positive. You learn that the test used gives a true positive result in 99 out of 100 cases. Thus, you conclude that you contracted the virus with a probability of 99%.

Such a conclusion would be based on a famous fallacy known as the “confusion of the inverse”. It is generally incorrect to equate the sensitivity of a test (i.e., the probability of a positive test result when the disease is present) with its positive predictive value (PPV; i.e., the probability of having the disease when the test result is positive). To appreciate the difference, note that the probability that you are bleeding, given that you have just been attacked by a shark, is close to one, whereas the probability of just having been attacked by a shark, given that you are bleeding, is almost zero.

The correct answer requires Bayes’ theorem, and examples similar to the one above are usually how students are first exposed to it. Bayes theorem teaches us that in order to obtain the PPV, we need two additional pieces of information: the tests’ specificity and the prior (baseline) probability that one has the virus (i.e., the prevalence).

The new version of JASP contains binary classification under the Learn Bayes module and facilitates the demonstration of similar situations in an easy-to-use interactive way. However, while binary classification is a clear example to demonstrate Bayes theorem, it is in fact not a Bayesian example per se. Specifically, the typical binary classification example is presented with fixed values of the key quantities (i.e., prevalence, sensitivity, and specificity); in real-life applications, however, these quantities are unknown. The uncertainty surrounding these values needs to be quantified and propagated through our calculations. Because at JASP we value doing things in a proper Bayesian way but appreciate the simplicity of the typical example, binary classification is implemented in two ways: the classical one where all parameters are fixed to a point value, and the Bayesian one where all parameters are unknown and therefore associated with uncertainty. This blog post shows both implementations.

Classical binary classification

The classical binary classification example is activated when users choose Input type → Point estimates options in the analysis menu. This setting is ideal in the scenario to present the use and logic of Bayes’ theorem. Here, the users define point estimates for the three key quantities: prevalence, sensitivity, and specificity. In this example, we let the input at its default values seen at Figure 1.

Figure 1. Binary classification input panel.

Once these three key quantities are set, and provided that the Statistics box is checked, JASP outputs a table with various statistics derived from these quantities. The table also shows the interpretation of the statistics. The first three rows repeat the input values. The remaining rows present statistics such as PPV and Negative Predictive Value (NPV), which convey the probability of a positive (or negative) condition given that the test result is positive (or negative), respectively.

Figure 2. Statistics table.

By selecting the Confusion Matrix checkbox, JASP outputs the same information in a confusion matrix format (shown in Figure 3), where test result outcomes are in the vertical axis and possible conditions are on the horizontal axis. By default, the Additional info button is clicked, and the labels of several statistics are displayed in the corresponding cells. However, the users can unclick the additional info button to display a classical 2×2 confusion matrix to view True Positives, False Positives, False Negatives, and True Negatives. By selecting the Number radio button, the corresponding numbers appear in the cells. The Both radio button displays both the numbers and the accompanying labels at the top to ease the interpretation:

Figure 3. Confusion matrix.

The section Plots offers visualisations of the information. Let’s go over the plot options one-by-one:

Probability Positive. This plot (shown in Figure 4) is one of the basic plots that shows the critical feature of the binary classification example. Essentially, this bar-chart conveys to the users the probability of a positive condition (i.e., COVID-19 infection) given the test result outcome (i.e., Positive or Negative), the baseline rate, and the test characteristics (i.e., Sensitivity and Specificity).

The salmon-pink columns labeled as “Not-tested” correspond to the prior probability of an individual being infected with the disease before being tested. The prior probability is equivalent to the specified prevalence.

The turquoise columns labeled as “Tested” show the posterior probability that the condition is positive after a negative and a positive test result, respectively.

For example, Figure 4 shows that the probability of a positive condition given a positive test result (see the turquoise column on the right-hand side) would be 30.8%, as also indicated in the tables. Despite a positive test result from a test with relatively high sensitivity and specificity (80% for both quantities), the posterior probability of infection is only 30.8%, which shows the influence of the low prevalence (10%). On the other hand, after a negative test result, the probability of a positive condition fell far below the prevalence of the disease.

Figure 4. Probability positive plot.

Icon plot. The icon plot (Figure 5) aids understanding by showing particular numbers of people being healthy, infected, diagnosed positively, and diagnosed negatively. If 100 people were living in a context where the prevalence of a disease is 10%, then 90 people would be healthy and 10 people would be infected. A perfect test (with sensitivity and specificity equal to one) would correctly classify these ninety people as healthy and the remaining ten people as infected. In reality, imperfect tests produce false negatives and false positives. Let’s see what happens with a test that has both sensitivity and specificity equal to 0.8. Out of the 10 people who are infected, the test correctly identifies 8 of them (True Positives – Green Icons) but misses 2 infected people (False Negatives – Red Icons). Similarly, out of 90 healthy individuals, the test correctly diagnoses 72 of them as healthy (True Negatives – Blue Icons). However, the remaining 18 healthy people are incorrectly diagnosed with the disease (False Positives – Yellow Icons). If we condition on people having positive tests, we count 26 people who have been tested positive, of which only 8 actually contracted the disease; thus after testing positive to the disease, the probability of having the disease is 8/26=30.8%.

Figure 5. Icon plot

ROC. As the previous section briefly mentioned, we usually do not have perfect tests. Maximizing both quantities at the same time is in practice constrained, as they are inversely related. Increasing sensitivity comes at the cost of decreasing specificity and vice versa. We can imagine this as moving a threshold that defines what is considered a positive and negative case. If we set the threshold relatively leniently, we can detect more true positive cases, but at the cost of polluting the pool of positive test results with false positives.

A Receiver Operating Characteristic Curve (ROC, Figure 6) illustrates the diagnostic ability of a binary classifier when this threshold is varied (i.e., shows the trade-off between sensitivity and specificity). In this example, we imagine the two sub-populations (positive and negative cases) be distributed as two normal distributions. The difference between the location of the two distributions determines how well these sub-populations are separated and thus easy to classify. We then imagine moving the test threshold from left to right and recording the 1-specificity on the x-axis and sensitivity on the y-axis. A perfect test would be so good at discriminating between positive and negative cases that there would be no trade-off, and the plot would show a single dot in the upper-left corner. If the test were essentially useless at discriminating between positive and negative cases, we would see a perfect trade-off displayed in the plot as a diagonal line. Figure 6 shows that our ROC lies somewhere in between these two extremes. The actual location of the test at hand (0.8 sensitivity and 0.8 specificity) is indicated by the grey dot.

Figure 6. Receiving Operating Characteristic Curve.

Test characteristics by threshold. A similar demonstration of the trade-off between sensitivity and specificity is to plot them as a function of the test threshold, instead of plotting them against each other. Figure 7 shows the test characteristics when the threshold varies and shows that when the test is set more conservative (higher threshold), specificity increases but sensitivity decreases, and vice versa.

Figure 7. Test characteristics by threshold.

PPV and NPV by prevalence. This plot (Figure 8) shows how positive predictive value (PPV) and negative predictive value (NPV) change as a function of prevalence when the test characteristics are kept constant. The dashed vertical line corresponds to user-defined prevalence, which is 0.1 by default. With the default prevalence and the specified test characteristics, a positive test result indicates a PPV of 0.308 (where the dashed line crosses the red curve). A negative test result reflects an NPV of 0.973 (where the dashed line crosses the blue curve). This underscores that we should not interpret a test outcome at face value: when a disease is rare, a positive test result does not mean a high PPV. However, if prevalence increased to 0.50, a positive test result would lead to a PPV of 0.8!

Figure 8. Positive predictive value and negative predictive value by prevalence.

Alluvial plot. Another visualization that offers insight into the example is the alluvial plot (shown in Figure 9). This plot shows two bars connected with colored lines. The bar on the left shows the proportion of the population whose condition is positive or negative. The right bar shows the proportion of the population whose test is positive or negative. In the current setting, because positive condition is rare, the pool of positive tests is composed primarily of false positives, despite the test having relatively good characteristics. It is also easy to see that there is not always a one-to-one correspondence between the proportion of positive cases and proportion of positive tests. In the current setting, if one estimated prevalence using the proportion of positive tests at face value, this would yield a prevalence estimate of around 25%, despite it actually being 10%.

Figure 9. Alluvial plot.

Signal Detection. This plot (Figure 10) shows the densities of the sub-population that is negative (i.e., the left normal distribution) and the sub-population that is positive (i.e., the right normal distribution). As the prevalence of the disease is set to a relatively low rate, the area under the curve of the negative sub-population is larger.

Importantly, these two distributions overlap. If they did not overlap, it would be straightforward to set a threshold value (dashed line) that correctly discriminates the condition of anyone without error. However, the fact that there is overlap means that no error-free criterion is available, and errors are unavoidable.

Any diagnostic test that yields a value above the defined criterion classifies a person as infected. Otherwise, they are labeled as healthy. If the primary interest were to correctly identify every healthy person, we would need to set the criterion value to the end of the right tail of the healthy subpopulation normal distribution. In this case, there would be no false positives (i.e., as all healthy individuals would be correctly labeled as healthy – no yellow zone on the graph). The price that is paid for such high specificity is an abundance of false negatives (i.e., many infected people would be incorrectly classified as healthy even if they were infected). Choosing a very stringent criterion (i.e., setting the criterion at the end of the left tail of the infected subpopulation normal distribution), on the other hand, avoids any false negatives, but at the expense of severely inflated false positive rates.

Figure 10. Signal detection plot.

The Bayesian binary classification

So far, everything presented above was not Bayesian in the sense that the key quantities (prevalence, sensitivity, and specificity) were fixed to a point value. In real life, these quantities are unknown and therefore associated with uncertainty. If we propagate this uncertainty through our calculations, we discover that the ambiguity in our inferences drawn from a diagnostic tool may be surprisingly large. To take the uncertainty into account, we only need to specify the three key quantities as probability distributions instead of fixed points, as shown in Figure 11. The prevalence, sensitivity, and specificity are now unknown parameters that are bounded between zero and one, and the probability distribution that specifies the plausibility of their specific value is the Beta distribution. Additionally, if we had some data regarding some test in the form of the numbers of true positive, false positive, false negative, and true negative cases, we can update these distributions to reflect our knowledge about the parameter values. This updating uses the conjugacy of beta distribution, and the updated distributions are shown when the Priors and posteriors option is checked (the resulting table is shown in Figure 12).

Figure 11. To specify a Bayesian binary classification example, prevalence, sensitivity and sensitivity are defined as unknown parameters with a probability distribution. This distribution may be updated if we observe additional data.

Now, we are all set to propagate the uncertainty through our calculations: if we decide to show the statistics and confusion matrix tables, we will see that they contain credible intervals that quantify the uncertainty in our estimates. The plots are also modified to show uncertainty where appropriate. For example, selecting Estimates plot and choosing to display Prevalence, Sensitivity, Specificity, Positive predictive value, and Negative Predictive Value will result in a plot shown in Figure 13, where the entire probability distributions and the credible intervals are displayed in addition to the traditional point estimates. The plot shows that the uncertainty about sensitivity is relatively high (prevalence is low so that positive condition cases, which are needed to estimate sensitivity, are relatively rare in the sample compared to negative condition cases, which are needed to estimate specificity), which also propagates into large uncertainty in PPV, whereas NPV is associated with relatively less uncertainty.

Figure 12. Priors and posteriors.

Figure 13. Plots of estimates show the entire distributions of the quantities of interest

Conclusion

What is usually relevant (at least for the individual concerned) is the probability of the disease given the symptoms, not the probability of the symptoms given the disease. To obtain this probability of interest, we need both test characteristics (i.e., sensitivity and specificity) and the base rate (i.e., the prevalence). This kind of binary classification example illustrates the use of Bayes’ theorem. Bayesian statistics “gets the conditioning right”. Apart from obtaining the probability of interest, Bayesian inference respects the uncertainty associated with the key quantities and propagates it through the appropriate calculations. We hope that the binary classification tool under the Learn Bayes module in JASP 0.15 will be helpful to those who wish to learn about the Bayes’ theorem, and to those that wish not to fall into the “confusion of the inverse” trap again.

About The Authors

Emir Erhan

Emir is a Research Master’s (Psychology) student at the University of Amsterdam.

Šimon Kucharský

Šimon is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.