Why uncertainty is not a nice to have but a need to have.

Currently, the dominant approach in machine learning is based on Maximum likelihood estimation and stems from **Frequentist **statistics. It can be thought of as a point-wise estimate of the model parameters. The main benefit of this approach in machine learning is that the models can have a higher throughput but it very often results in models that are overconfident and that do not generalize to data outside of what it was trained on.

The opposite of this is**Bayesian **statistics, which estimates the model's parameters using Bayes' theorem. What is unique about Bayesian statistics is that all observed and unobserved parameters in a statistical model are given a joint probability distribution, resulting in a distribution of potential values for each parameter in the model. This results in a model that accurately quantifies the uncertainty in the predictions and tends to generate better too new data, but with the cost of less model throughput.

In order to illustrate what this actually means, we trained a very basic model named**LeNet5*** *on a data set called **MNIST **which is a dataset of handwritten digits, 0 to 9. The task is simply for a given picture of a handwritten digit, to predict what the digit is. We set up a **Bayesian **and **frequentist **LeNet5 model and train it using the MNIST data. In this setup both models predict a probability distribution for the different digits that sum to one, but normally in today's AI/ML one just chooses the most probable class as the classified class. In the upper row of Fig. 1 we send in a handwritten *8* and we see the distribution of the probabilities. In this case, both the frequentist and the Bayesian model classify it as an *8,* but the Bayesian model gives some probability to it also being a *3, 4* or *6* but by far the most likely of it being an *8*.

The opposite of this is

In order to illustrate what this actually means, we trained a very basic model named

Fig 1) To the left, the input image of a hand-drawn 8 to the MNST model, and to the right the probabilities of the frequentist and Bayesian model.

In order to understand the dangers of today's AI/ML one needs to see what happens when one uses the models on images that are slightly different than what it was trained upon. Since the model has only been trained on single digits we send in the letter *X* in order to see how the model reacts to different data. This can be seen in the second row of fig. 1, also in this case the frequentist is also very confident giving it a probability of over 80 % being an *8*. On the other hand, the Bayesian model also predicts the *8* but with a probability of just north of 30 % and also gives a high probability of it being a 4, *5* or *7.*

Fig 2) To the left is the input image of a hand-drawn X to the MNST model, and to the right the probabilities of the frequentist and Bayesian model.

The reason this happens is how the model works in combination with whether one uses a frequentist approach or a Bayesian approach. A model that classifies images does so by learning to identify specific features, in the case of an *8* it could feature such as the small x that gets formed where the loops stack together, a line from upper right to lower right, a line from upper right to lower left and arch segments on the top and bottom.

Since the*X* also has some of those features, it is understandable why the digit *8* is the highest probability for both. But since the frequentist model relies on a maximum likelihood approach, it becomes very confident based on the presence of just a subset of the needed features. On the other hand, the Bayesian model gets uncertain and signals that the image contains features that are common for *4, 5, 6, 8*, but it is not confident in which class it should be.

It is this overconfidence based on the presence of just a subset of features that makes it dangerous to use today's machine learning models in high-stakes situations. One needs a model that properly quantifies uncertainty, and it is why Bayesian models are so important in computer vision, especially in high-stakes situations. It enables the model to communicate what it is uncertain about and what it does not know.

Since the

It is this overconfidence based on the presence of just a subset of features that makes it dangerous to use today's machine learning models in high-stakes situations. One needs a model that properly quantifies uncertainty, and it is why Bayesian models are so important in computer vision, especially in high-stakes situations. It enables the model to communicate what it is uncertain about and what it does not know.

There are two main types of uncertainty a **Bayesian **model can quantify, the **Aleatoric **uncertainty which refers to variability in outcome when the same experiment is run multiple times. This is what was illustrated above when we fed an X into the model for handwritten digits, the probability was distributed fairly evenly across several potential digits for the Bayesian model.

The other type of uncertainty is Epistemic uncertainty, which can be thought of as systematic uncertainty. It quantifies things one could in principle know but does not in practice. This may be because the measurement is not accurate, because the model neglects certain effects, or that some particular data is missing from the model.

A simple example of this is someone is rolling a die 4 times and gets 2, 4, 4, 6, what should the model be for the possible numbers the die can give be? Most dies have 6 sides and values 1 to 6, but maybe it has 8 sides, or the numbers are not in the range 1 to 6 but rather 2, 4, 6, 8, 10, 12. One can improve this model by simply rolling the die more times, this is what Epistemic uncertainty quantifies.

To better illustrate this we use Fig. 2 which illustrates two patients (left, and right) asking our AI Physician if they have cancer. The upper section shows what happens when the physician is built on traditional (frequentist) AI/ML methods i.e. no uncertainty. Since there is no uncertainty in the frequentist case, he will give the same answer every time asked about the same patient. Whereas the bayesian can give different answers, this is visualized in Fig. 2 by showing only one patient to the frequentist doctor and two patients to the bayesian doctor.

The patient to the left shows the effect of Epistemic uncertainty, i.e. how confident the AI Physician is in his diagnosis. The frequentist doctor has no way of knowing how confident he is in his prediction, so most times he will be right but sometimes he will miss diagnosing the patient. The Bayesian doctor knows when he is confident in his predictions and knows when his predictions will vary, thus knowing when further testing is needed.

The patient to the right shows the effect of Aleatoric uncertainty, for this patient the epistemic uncertainty (patient to the left) is low, however, the Aleatoric uncertainty is high. This corresponds to our AI physician being very sure that it doesn't know the correct diagnosis and that the patient should be referred to an expert.

The other type of uncertainty is Epistemic uncertainty, which can be thought of as systematic uncertainty. It quantifies things one could in principle know but does not in practice. This may be because the measurement is not accurate, because the model neglects certain effects, or that some particular data is missing from the model.

A simple example of this is someone is rolling a die 4 times and gets 2, 4, 4, 6, what should the model be for the possible numbers the die can give be? Most dies have 6 sides and values 1 to 6, but maybe it has 8 sides, or the numbers are not in the range 1 to 6 but rather 2, 4, 6, 8, 10, 12. One can improve this model by simply rolling the die more times, this is what Epistemic uncertainty quantifies.

To better illustrate this we use Fig. 2 which illustrates two patients (left, and right) asking our AI Physician if they have cancer. The upper section shows what happens when the physician is built on traditional (frequentist) AI/ML methods i.e. no uncertainty. Since there is no uncertainty in the frequentist case, he will give the same answer every time asked about the same patient. Whereas the bayesian can give different answers, this is visualized in Fig. 2 by showing only one patient to the frequentist doctor and two patients to the bayesian doctor.

The patient to the left shows the effect of Epistemic uncertainty, i.e. how confident the AI Physician is in his diagnosis. The frequentist doctor has no way of knowing how confident he is in his prediction, so most times he will be right but sometimes he will miss diagnosing the patient. The Bayesian doctor knows when he is confident in his predictions and knows when his predictions will vary, thus knowing when further testing is needed.

The patient to the right shows the effect of Aleatoric uncertainty, for this patient the epistemic uncertainty (patient to the left) is low, however, the Aleatoric uncertainty is high. This corresponds to our AI physician being very sure that it doesn't know the correct diagnosis and that the patient should be referred to an expert.

Fig. 3 Comparison of a frequentist and bayesian AI physician and the effect of Aleatoric and Epistemic uncertainty.

So if one now has a model that properly quantifies uncertainty, the key is to properly use the uncertainty in the given problem domain. As if one just takes the most likely class one, is back in the frequentist case with all its poor decisions.

Based on our AI physician case, say our model looks at chest x-rays and determines if the patient has cancer or not, a simple yes-no prediction. As we saw in the MNIST example, a frequentist model does not properly estimate the uncertainty but tends to get overconfident in the fact that a dominant feature is present in the model, and it disregards the fact that other features are missing. This means that a frequentist model can usually achieve high predictive accuracy, but in some cases, it will be way off without signaling it might be off.

This might be acceptable if one applies the technology to classify items as a square or a circle on a conveyor belt with a very small cost of a wrong prediction, but if one uses it to diagnose cancer, it is unacceptable. As literally a false negative can cost someone's life and a false positive will create enormous emotional distress. This is where one needs a bayesian model that quantifies the uncertainty correctly and provides both the aleatoric and epistemic uncertainty, such that the model can communicate when it is uncertain and the x-ray can be sent on to an expert for further analysis.

In these critical situations, the aim is not to build an entirely autonomous system that gives yes-no answers directly to the patient. But rather a system that can say yes-no when it is very certain and at the same time knows when an expert opinion is needed or when a new scan is needed. Or to help build even better tolling for the experts, if one instead moves from a simple image classification model that classifies cancer yes-no, to a model that segments the image such that it locates the regions that are cancerous.

Now the model provides a prediction of the class for every pixel in the image and not just a class for the whole image. For today's AI/ML models one would give the prediction per pixel, but it would struggle just the same with overconfidence as explained previously. But now with a Bayesian model, we can also get the uncertainty in each pixel, allowing us to better see where in the image it is uncertain.

Based on our AI physician case, say our model looks at chest x-rays and determines if the patient has cancer or not, a simple yes-no prediction. As we saw in the MNIST example, a frequentist model does not properly estimate the uncertainty but tends to get overconfident in the fact that a dominant feature is present in the model, and it disregards the fact that other features are missing. This means that a frequentist model can usually achieve high predictive accuracy, but in some cases, it will be way off without signaling it might be off.

This might be acceptable if one applies the technology to classify items as a square or a circle on a conveyor belt with a very small cost of a wrong prediction, but if one uses it to diagnose cancer, it is unacceptable. As literally a false negative can cost someone's life and a false positive will create enormous emotional distress. This is where one needs a bayesian model that quantifies the uncertainty correctly and provides both the aleatoric and epistemic uncertainty, such that the model can communicate when it is uncertain and the x-ray can be sent on to an expert for further analysis.

In these critical situations, the aim is not to build an entirely autonomous system that gives yes-no answers directly to the patient. But rather a system that can say yes-no when it is very certain and at the same time knows when an expert opinion is needed or when a new scan is needed. Or to help build even better tolling for the experts, if one instead moves from a simple image classification model that classifies cancer yes-no, to a model that segments the image such that it locates the regions that are cancerous.

Now the model provides a prediction of the class for every pixel in the image and not just a class for the whole image. For today's AI/ML models one would give the prediction per pixel, but it would struggle just the same with overconfidence as explained previously. But now with a Bayesian model, we can also get the uncertainty in each pixel, allowing us to better see where in the image it is uncertain.

Fig 4. Uncertainty for semantic segmentation

In fig. 4 one can see an example of semantic segmentation of a medical scan, where the prediction of cancer yes-no is based on it being predicted anywhere in the image. On the left is the prediction one would see from standard AI/ML today, on the right is the uncertainty on a pixel level a bayesian model would provide in addition to that. There one can clearly see both how confident the model is in the prediction and also give valuable feedback to an expert to locate what is strange in the image that warrants further investigation.

In high-stakes problems such as medical imaging, using today's AI/ML approach based on maximum likelihood estimation is outright dangerous. One needs to properly estimate the uncertainty and include that in the decision-making process. That is why we at Desupervised have put a lot of thought and effort into how to take the current state-of-the-art frequentist models and add proper uncertainty estimation to them via bayesian inference.

This is AI when it really matters.