avatarDr. Robert Kübler

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

7823

Abstract

<i>), σ²</i>) just means the following:</p><figure id="2661"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*enrEnna4nyBDAaZSy99DGQ.png"><figcaption>Image by the author.</figcaption></figure><p id="866a">This is just the density function of the normal distribution with mean <i>ŷ=μ(</i>x<i>) </i>and standard deviation<i> σ </i>that describes the distribution of a single label<i> y. </i>Now, we don’t have a single observation <i>y</i> and its corresponding prediction <i>ŷ</i>, but several, let’s say <i>n</i>. Assuming that all observations are stochastically independent, we get</p><figure id="15c7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*dAQNyix1pa6hYSqXCYPoew.png"><figcaption>Image by the author.</figcaption></figure><p id="019f">Training a neural network now basically means something that statisticians call <i>maximum-likelihood estimation. </i>This is a fancy way of saying that<i> </i>we want to <b>maximize</b> the above density function, also called the <i>likelihood function</i>.</p><p id="b568">Now, we can connect the maximum-likelihood estimation to the MSE minimization like this:</p><ol><li>Maximizing the likelihood function</li><li>means maximizing the rightmost term</li><li>means maximizing the exponent of <i>e</i></li><li>means minimizing the sum in the exponent,</li><li>means minimizing the MSE (dividing by <i>n</i> does not change the optimal parameters).</li></ol><p id="429b">Or as a picture:</p><figure id="8219"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*pV5_YuML2m-2pb7qezwkoQ.png"><figcaption>Image by the author.</figcaption></figure><h2 id="d790">Generalizing the MSE Loss</h2><p id="6e2a">Congratulations if you survived the last section, you made it far, and you nearly reached your goal! We just have to make a simple observation:</p><p id="4aed" type="7">We treated σ as a constant and basically ignored it when doing the maximum-likelihood approach.</p><p id="3d5d">But <i>σ</i> is exactly what we want to estimate as well! This is because it captures the uncertainty in the predictions by definition. So, how about we let our model <b>output a value <i>σ</i>(<i>x</i>) additionally to <i>μ</i>(<i>x</i>)</b>? This means that even for a simple regression, the model will have two outputs: one estimate for the true value <i>μ</i>(<i>x</i>), and the uncertainty estimate <i>σ</i>(<i>x</i>) given <i>x</i>.</p><figure id="edcd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*8BHLsk0KPKopgEgcG9Z63g.png"><figcaption>Image by the author.</figcaption></figure><p id="b01c">Now, we can just replace all the <i>σ </i>by <i>σ</i>(<i>xᵢ</i>) in the above equations and we end up with the following statement: Maximizing the likelihood function means maximizing the term</p><figure id="fc86"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*MpJhd5ZEC4A-yQUdG8SHiw.png"><figcaption>Image by the author.</figcaption></figure><p id="8c66">This in turn means minimizing the huge sum in the exponent, which is our <b>newly derived loss function (without a catchy name, post suggestions in the comments 😉):</b></p><figure id="b930"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*lgiTAmIVy_gDOTxabAmejQ.png"><figcaption>Image by the author.</figcaption></figure><p id="4d58">Note that I smuggled a 1/<i>n</i> in, but this does not change the optimal solution, as in the case of the MSE.</p><blockquote id="6373"><p><b><i>Note:</i></b><i> This loss has some interesting properties. First, it still contains the MSE bit (</i>yᵢ<i></i>μ<i>(</i>xᵢ<i>))². Additionally, there are two terms involving </i>σ<i>: ln(</i>σ<i>(</i>x<i>)) as well as 1/</i>σ<i>(</i>x<i>).</i></p></blockquote><blockquote id="6ba3"><p><i>In order to keep the loss low, the <b>model cannot output very large values for </b></i><b>σ<i>(</i>x<i>)</i></b><i> because as </i>σ<i>(</i>x<i>) grows, ln(</i>σ<i>(</i>x<i>)) increases as well. The model cannot output very small values close to zero either because then the term 1/</i>σ<i>(</i>x<i>) becomes large. Thus, the model is forced to output a </i>reasonable<i> guess for </i>σ<i>(</i>x<i>)</i> <i>to <b>balance the penalty</b> of both terms.</i></p></blockquote><blockquote id="87fd"><p><i>Only if (</i>yᵢ<i></i>μ<i>(</i>xᵢ<i>))² is small, i.e. the predicted value is quite close to the truth, the model can afford outputting a small standard deviation </i>σ<i>(</i>x<i>). In this case, the model is quite sure about its prediction.</i></p></blockquote><p id="b998">Alright, enough of the theory. We deserved some coding now!</p><h1 id="39d8">Implementation in Tensorflow</h1><p id="1659">Alright, so we have learned that we need two things to make a standard neural network output uncertainty:</p><ol><li>A second output node that contains the predicted standard deviation (=uncertainty) and</li><li>the custom loss function as stated above.</li></ol><p id="01dc">It should be easy to implement both things in any deep learning framework of your choice. We will do it in Tensorflow, just because last time I have already chosen PyTorch to explain interpretable neural networks. 😎</p><div id="ef49" class="link-block"> <a href="https://towardsdatascience.com/interpretable-neural-networks-with-pytorch-76f1c31260fe"> <div> <div> <h2>Interpretable Neural Networks With PyTorch</h2> <div><h3>Learn how to build feed-forward neural networks that are interpretable by design using PyTorch</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*zdCUsn8cL-9QL7pO)"></div> </div> </div> </a> </div><p id="30c6">Let’s start with a simple example.</p><h2 id="84fe">Constant Noise</h2><p id="be8d">First, we will create a toy dataset consisting of 1000 points with constant noise via</p><div id="be1c"><pre><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf

tf.random.set_seed(<span class="hljs-number">0</span>)

X = tf.random.uniform(minval=-<span class="hljs-number">1</span>, maxval=<span class="hljs-number">7</span>, shape=(<span class="hljs-number">1000</span>,)) y = tf.sin(X) + tf.random.normal(mean=<span class="hljs-number">0</span>, stddev=<span class="hljs-number">0.3</span>, shape=(<span class="hljs-number">1000</span>,))</pre></div><p id="24b2">We can visualize this dataset:</p><figure id="7872"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*h6yIjU6TnL-m_BZyD3MZZA.png"><figcaption>Image by the author.</figcaption></figure><p id="c577">Alrighty, so it is merely a sine wave with <i>N</i>(0, 0.3²) distributed noise added to it. In the best case, the actual prediction of the model follows the sine wave, while each uncertainty estimate is around 0.3. We build a simple feed-forward network via</p><div id="ceef"><pre>model = tf.keras.Sequential([ tf.keras.layers.Dense(<span class="hljs-number">32</span>, activation=<span class="hljs-string">'relu'</span>), tf.keras.layers.Dense(<span class="hljs-number">32</span>, activation=<span class="hljs-string">'relu'</span>), tf.keras.layers.Dense(<span class="hljs-number">32</span>, activation=<span class="hljs-string">'relu'</span>), tf.keras.layers.Dense(<span class="hljs-number">2</span>) <span class="hljs-comment"># Output = (μ, ln(σ))</span> ])</pre></div><p id="9a21">Ok, so we dealt with the first ingredient already by defining a neural network with two outputs.</p><p id="4442"><b>To simplify the computations, let us assume that the second output is not <i>σ</i>(<i>xᵢ</i>) directly, but ln(<i>σ</i>(<i>xᵢ</i>)) instead.</b> We do t

Options

his because the two neurons from the last layer can output arbitrary real values, especially values that are <b>less than zero</b>, which does not make sense for the standard deviation. But the logarithm of the standard deviation can be any real number, so the domains match then. And we need ln(<i>σ</i>(<i>xᵢ</i>)) in the loss function anyway, so let’s go for it. Speaking of the loss function, we can define it via</p><div id="9a0d"><pre><span class="hljs-keyword">def</span> <span class="hljs-title function_">loss</span>(<span class="hljs-params">y_true, y_pred</span>): mu = y_pred[:, :<span class="hljs-number">1</span>] <span class="hljs-comment"># first output neuron</span> log_sig = y_pred[:, <span class="hljs-number">1</span>:] <span class="hljs-comment"># second output neuron</span> sig = tf.exp(log_sig) <span class="hljs-comment"># undo the log</span>

<span class="hljs-keyword">return</span> tf.reduce_mean(<span class="hljs-number">2</span>*log_sig + ((y_true-mu)/sig)**<span class="hljs-number">2</span>)</pre></div><p id="cffa">The rest is business as usual. You compile the model with this loss function and fit.</p><div id="f388"><pre>model.<span class="hljs-built_in">compile</span>(loss=loss)

model.fit( tf.reshape(X, shape=(<span class="hljs-number">1000</span>, <span class="hljs-number">1</span>)), tf.reshape(y, shape=(<span class="hljs-number">1000</span>, <span class="hljs-number">1</span>)), batch_size=<span class="hljs-number">32</span>, epochs=<span class="hljs-number">100</span> )</pre></div><p id="4217">Let us check the uncertainty estimates that the model has learned:</p><div id="12db"><pre><span class="hljs-built_in">print</span>(tf.exp(model(X)[:<span class="hljs-number">20</span>, <span class="hljs-number">1</span>]))

<span class="hljs-comment"># Output:</span> <span class="hljs-comment"># tf.Tensor(</span> <span class="hljs-comment"># [0.29860803 0.27371496 0.32216415 0.32288837 0.31084406 0.30166912</span> <span class="hljs-comment"># 0.32059005 0.3331769 0.31244662 0.31863096 0.30940703 0.32042852</span> <span class="hljs-comment"># 0.3231969 0.29584357 0.31141806 0.32493973 0.3169802 0.32060665</span> <span class="hljs-comment"># 0.30542135 0.31733593], shape=(20,), dtype=float32)</span></pre></div><p id="11b2">Looks good to me! The model learned that the noise has a standard deviation of around 0.3. And here is a visualization of what the model has learned:</p><figure id="3972"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*0DG4OdcWHLHMDl1XBGwU8w.png"><figcaption>Image by the author.</figcaption></figure><p id="f451">That’s how we like it. The actual prediction <i>μ<b> </b></i>follows the data while the uncertainty <i>σ </i>is just high enough to capture the noise in the labels <i>y</i>.</p><h2 id="7184">Varying Noise</h2><p id="d695">We now spice things up a little bit by introducing non-constant noise, something that statisticians call <i>heteroscedasticity</i>. Take a look at this:</p><div id="dd16"><pre>tf.random.set_seed(<span class="hljs-number">0</span>)

X = tf.random.uniform(minval=-<span class="hljs-number">1</span>, maxval=<span class="hljs-number">7</span>, shape=(<span class="hljs-number">1000</span>,)) sig = <span class="hljs-number">0.1</span>*(X+<span class="hljs-number">1</span>) y = tf.sin(X) + tf.random.normal(mean=<span class="hljs-number">0</span>, stddev=sig, shape=(<span class="hljs-number">1000</span>,))</pre></div><p id="51d8">This creates a dataset with noise increasing in feature <i>X</i>.</p><figure id="512d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UeHWPEcLA-oRQku3hl81Cw.png"><figcaption>Image by the author.</figcaption></figure><p id="e3ec">The ground truth is still the same: it’s a sine wave, and the model should be able to capture this. However, the model should also learn that higher values for <i>X</i> mean higher uncertainty.</p><p id="b77c"><b>Spoiler: If you re-train the same model as above on the new dataset, this is exactly what you will see.</b></p><figure id="01d9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*1MQGjsRfzhcSfXnsdGr4yw.png"><figcaption>Image by the author.</figcaption></figure><p id="8656">Pretty sweet in my opinion.</p><h1 id="5032">Conclusion</h1><p id="998b">In this article, you have learned how to tweak a neural network so that it can output estimates for uncertainty together with its actual prediction. All it takes is an additional output neural and a loss function that is only slightly more complicated than the MSE.</p><p id="baa4">The good thing about uncertainty estimates is that they let you assess the model’s confidence in its predictions — you know whether you can trust the model’s predictions or not. They also allow you to report lower or upper bounds for estimates, something that is worth a lot when calculating best or worst-case scenarios.</p><p id="ca52">Another popular way of getting uncertainty estimates is using Bayesian inference. However, the math is more involved and it is much slower than the solution that I presented to you here. Also, I find the packages for (deep) Bayesian learning not as easy to use as Tensorflow or PyTorch at the moment, although this might change when Bayesian methods gain even more traction. Still, I love this topic, so check it out as well! 😉</p><p id="f436">What I have given you here is a simple tool that lets you circumvent the Bayesian hassle and does not require you to change much in your everyday behavior while still giving you a great benefit from the Bayesian world.</p><p id="5f16"><b>Bonus (thanks to the great inputs of <a href="https://www.linkedin.com/in/carlosayam/">Carlos Aya-Moreno</a>):</b> An additional way of getting uncertainty estimates is by using bootstrapping. Basically, this is what is happing if you use random forests: you create <i>b</i> smaller datasets from your original dataset by subsampling, train a model on each of them, and then you get <i>b </i>different predictions for an input. The mean of these <i>b</i> predictions is your final prediction, while the standard deviation of these <i>b</i> predictions is a measure of uncertainty. For example, if all of the <i>b</i> models’ outputs are kind of the same, the uncertainty will be small.</p><p id="bcc9">The problem with this approach is, however, that you need to train <i>b</i> different models, which can be quite expensive. In random forests, it works well because a single decision tree is fast and easy to fit. For neural networks, things look darker. There was also <a href="https://www.gatsby.ucl.ac.uk/~balaji/why_arent_bootstrapped_neural_networks_better.pdf">research</a> done by Jeremy Nixon et al. that even leaving the computational issue aside, bootstrapping neural networks might not be too beneficial.</p><p id="775f">I hope that you learned something new, interesting, and useful today. Thanks for reading!</p><p id="442d" type="7">If you have any questions, write me on LinkedIn!</p><p id="a952">And if you want to dive deeper into the world of algorithms, give my new publication <b>All About Algorithms</b> a try! I’m still searching for writers!</p><div id="1e69" class="link-block"> <a href="https://medium.com/all-about-algorithms"> <div> <div> <h2>All About Algorithms</h2> <div><h3>From intuitive explanations to in-depth analysis, algorithms come to life with examples, code, and awesome…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*Sk0oYguzczy4EEMCe7ef9Q.png)"></div> </div> </div> </a> </div></article></body>

Get Uncertainty Estimates in Regression Neural Networks for Free

Given the right loss function, a standard neural network can output uncertainty as well

Photo by Christina Deravedisian on Unsplash

Whenever we build a machine learning model, we usually design it in such a way that it outputs a single number as the prediction. Most models from scikit-learn work like this: tree-based models, linear models, nearest neighbor algorithms, and more. The same goes for XGBoost and the other boosting algorithms, as well as for deep learning frameworks such as Tensorflow or PyTorch.

While this is often fine, it would be better to have a measure of uncertainty around this point estimate as well. This is because the difference between “We will sell 1000±50 cars” and “We will sell 1000±5000 cars” is tremendous: From the first statement, you can conclude that the company will sell around 1000 cars, give or take, while the second statement tells you that the model has no clue at all.

Note: A lower uncertainty does not mean that the model is right. As people, it can also be very opinionated about something completely wrong. So, as usual, assessing the model’s quality is essential, also in this case.

Bayesian Inference

If you have read my articles about Bayesian inference (thanks!) you already know how to create models that output not only a single point, but a complete target distribution instead.

Image by the author.

Because of this, we can see when the model is uncertain just by looking at the predicted distribution, or some derived number such as the standard deviation. The narrower the distribution (the smaller the standard deviation), the more certain the model is.

However, this time we will not go Bayesian here. While Bayesian inference is a great field that you should study at some point, it has several shortcomings:

  • it is computationally even more involved than neural networks
  • it is harder to understand mathematically, and
  • you have to learn about new libraries.

So, this article is for the people that know their deep learning frameworks and want to include some uncertainty estimates without much hassle. However, if you feel like going into the realm of fully Bayesian neural networks at some point, try out libraries like Tensorflow Probability or Pyro for PyTorch.

Make Neural Networks Reveal Their Uncertainties

Disclaimer: Again, I do not know if the following method was presented in any paper or book. It just came to my mind and I wanted to write about it. If you know any source, please give me a hint in the comments and I will add it to the article. Thanks! Update: Matias Valdenegro Toro pointed out that the loss that I will introduce soon is called variance attenuation in their paper.

From now, let us stick with our favorite bread and butter neural networks. I will use Tensorflow code in this article but you can easily adapt everything to PyTorch or other frameworks as well. We will consider a regression problem here, but similar arguments can be made for classification tasks, too.

Deriving the Mean Squared Error

In order to understand how to get uncertainty estimates, we have to understand how to get point estimates first. We will then generalize this idea in a simple fashion. So, for a small recap, the following is the mean squared error (MSE) loss function:

Image by the author.

It makes sense intuitively: the larger the gap between some true value yᵢ and the model’s prediction ŷᵢ, the higher the loss. But we can argue in the same way when replacing the 2 with a 4 in the exponent. Or dropping the 2 and using the absolute value |yᵢ ŷᵢ| instead (mean absolute error, MAE).

So, what is special about the MSE? Which assumptions go into it? Let us find out. ⚠️Danger: Math ahead. If this is too much, just skip to the implementation section. The results are easy to apply, even if you cannot follow the theory yet.⚠️

The assumption is the following:

Given input features x, the true label y is distributed according to a normal distribution with mean μ(x) and standard deviation σ, i.e. y~N(μ(x), σ²). This means that the observed labels come from some true value μ(x), but got corrupted by some error with a standard deviation of σ. This error is also called noise. Note that very often we write ŷ instead of μ(x).

The task of a neural network (and most other models) is then to predict this μ(x) given x. This makes predictions right on average, and this is the best thing we can do because we are not able to predict the noise. Now, the expression y~N(μ(x), σ²) just means the following:

Image by the author.

This is just the density function of the normal distribution with mean ŷ=μ(x) and standard deviation σ that describes the distribution of a single label y. Now, we don’t have a single observation y and its corresponding prediction ŷ, but several, let’s say n. Assuming that all observations are stochastically independent, we get

Image by the author.

Training a neural network now basically means something that statisticians call maximum-likelihood estimation. This is a fancy way of saying that we want to maximize the above density function, also called the likelihood function.

Now, we can connect the maximum-likelihood estimation to the MSE minimization like this:

  1. Maximizing the likelihood function
  2. means maximizing the rightmost term
  3. means maximizing the exponent of e
  4. means minimizing the sum in the exponent,
  5. means minimizing the MSE (dividing by n does not change the optimal parameters).

Or as a picture:

Image by the author.

Generalizing the MSE Loss

Congratulations if you survived the last section, you made it far, and you nearly reached your goal! We just have to make a simple observation:

We treated σ as a constant and basically ignored it when doing the maximum-likelihood approach.

But σ is exactly what we want to estimate as well! This is because it captures the uncertainty in the predictions by definition. So, how about we let our model output a value σ(x) additionally to μ(x)? This means that even for a simple regression, the model will have two outputs: one estimate for the true value μ(x), and the uncertainty estimate σ(x) given x.

Image by the author.

Now, we can just replace all the σ by σ(xᵢ) in the above equations and we end up with the following statement: Maximizing the likelihood function means maximizing the term

Image by the author.

This in turn means minimizing the huge sum in the exponent, which is our newly derived loss function (without a catchy name, post suggestions in the comments 😉):

Image by the author.

Note that I smuggled a 1/n in, but this does not change the optimal solution, as in the case of the MSE.

Note: This loss has some interesting properties. First, it still contains the MSE bit (yᵢμ(xᵢ))². Additionally, there are two terms involving σ: ln(σ(x)) as well as 1/σ(x).

In order to keep the loss low, the model cannot output very large values for σ(x) because as σ(x) grows, ln(σ(x)) increases as well. The model cannot output very small values close to zero either because then the term 1/σ(x) becomes large. Thus, the model is forced to output a reasonable guess for σ(x) to balance the penalty of both terms.

Only if (yᵢμ(xᵢ))² is small, i.e. the predicted value is quite close to the truth, the model can afford outputting a small standard deviation σ(x). In this case, the model is quite sure about its prediction.

Alright, enough of the theory. We deserved some coding now!

Implementation in Tensorflow

Alright, so we have learned that we need two things to make a standard neural network output uncertainty:

  1. A second output node that contains the predicted standard deviation (=uncertainty) and
  2. the custom loss function as stated above.

It should be easy to implement both things in any deep learning framework of your choice. We will do it in Tensorflow, just because last time I have already chosen PyTorch to explain interpretable neural networks. 😎

Let’s start with a simple example.

Constant Noise

First, we will create a toy dataset consisting of 1000 points with constant noise via

import tensorflow as tf

tf.random.set_seed(0)

X = tf.random.uniform(minval=-1, maxval=7, shape=(1000,))
y = tf.sin(X) + tf.random.normal(mean=0, stddev=0.3, shape=(1000,))

We can visualize this dataset:

Image by the author.

Alrighty, so it is merely a sine wave with N(0, 0.3²) distributed noise added to it. In the best case, the actual prediction of the model follows the sine wave, while each uncertainty estimate is around 0.3. We build a simple feed-forward network via

model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(2) # Output = (μ, ln(σ))
])

Ok, so we dealt with the first ingredient already by defining a neural network with two outputs.

To simplify the computations, let us assume that the second output is not σ(xᵢ) directly, but ln(σ(xᵢ)) instead. We do this because the two neurons from the last layer can output arbitrary real values, especially values that are less than zero, which does not make sense for the standard deviation. But the logarithm of the standard deviation can be any real number, so the domains match then. And we need ln(σ(xᵢ)) in the loss function anyway, so let’s go for it. Speaking of the loss function, we can define it via

def loss(y_true, y_pred):
    mu = y_pred[:, :1] # first output neuron
    log_sig = y_pred[:, 1:] # second output neuron
    sig = tf.exp(log_sig) # undo the log
    
    return tf.reduce_mean(2*log_sig + ((y_true-mu)/sig)**2)

The rest is business as usual. You compile the model with this loss function and fit.

model.compile(loss=loss)

model.fit(
    tf.reshape(X, shape=(1000, 1)),
    tf.reshape(y, shape=(1000, 1)),
    batch_size=32,
    epochs=100
)

Let us check the uncertainty estimates that the model has learned:

print(tf.exp(model(X)[:20, 1]))

# Output:
# tf.Tensor(
# [0.29860803 0.27371496 0.32216415 0.32288837 0.31084406 0.30166912
# 0.32059005 0.3331769  0.31244662 0.31863096 0.30940703 0.32042852
# 0.3231969  0.29584357 0.31141806 0.32493973 0.3169802  0.32060665
# 0.30542135 0.31733593], shape=(20,), dtype=float32)

Looks good to me! The model learned that the noise has a standard deviation of around 0.3. And here is a visualization of what the model has learned:

Image by the author.

That’s how we like it. The actual prediction μ follows the data while the uncertainty σ is just high enough to capture the noise in the labels y.

Varying Noise

We now spice things up a little bit by introducing non-constant noise, something that statisticians call heteroscedasticity. Take a look at this:

tf.random.set_seed(0)

X = tf.random.uniform(minval=-1, maxval=7, shape=(1000,))
sig = 0.1*(X+1)
y = tf.sin(X) + tf.random.normal(mean=0, stddev=sig, shape=(1000,))

This creates a dataset with noise increasing in feature X.

Image by the author.

The ground truth is still the same: it’s a sine wave, and the model should be able to capture this. However, the model should also learn that higher values for X mean higher uncertainty.

Spoiler: If you re-train the same model as above on the new dataset, this is exactly what you will see.

Image by the author.

Pretty sweet in my opinion.

Conclusion

In this article, you have learned how to tweak a neural network so that it can output estimates for uncertainty together with its actual prediction. All it takes is an additional output neural and a loss function that is only slightly more complicated than the MSE.

The good thing about uncertainty estimates is that they let you assess the model’s confidence in its predictions — you know whether you can trust the model’s predictions or not. They also allow you to report lower or upper bounds for estimates, something that is worth a lot when calculating best or worst-case scenarios.

Another popular way of getting uncertainty estimates is using Bayesian inference. However, the math is more involved and it is much slower than the solution that I presented to you here. Also, I find the packages for (deep) Bayesian learning not as easy to use as Tensorflow or PyTorch at the moment, although this might change when Bayesian methods gain even more traction. Still, I love this topic, so check it out as well! 😉

What I have given you here is a simple tool that lets you circumvent the Bayesian hassle and does not require you to change much in your everyday behavior while still giving you a great benefit from the Bayesian world.

Bonus (thanks to the great inputs of Carlos Aya-Moreno): An additional way of getting uncertainty estimates is by using bootstrapping. Basically, this is what is happing if you use random forests: you create b smaller datasets from your original dataset by subsampling, train a model on each of them, and then you get b different predictions for an input. The mean of these b predictions is your final prediction, while the standard deviation of these b predictions is a measure of uncertainty. For example, if all of the b models’ outputs are kind of the same, the uncertainty will be small.

The problem with this approach is, however, that you need to train b different models, which can be quite expensive. In random forests, it works well because a single decision tree is fast and easy to fit. For neural networks, things look darker. There was also research done by Jeremy Nixon et al. that even leaving the computational issue aside, bootstrapping neural networks might not be too beneficial.

I hope that you learned something new, interesting, and useful today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

Bayesian Statistics
Neural Networks
Artificial Intelligence
Machine Learning
Uncertainty
Recommended from ReadMedium