Monte Carlo Dropout
Improve your neural network for free with one small trick, getting model uncertainty estimate as a bonus.
There ain’t no such thing as a free lunch, at least according to the popular adage. Well, not anymore! Not when it comes to neural networks, that is to say. Read on to see how to improve your network’s performance with an incredibly simple yet clever trick called the Monte Carlo Dropout.

Dropout
The magic trick we are about to introduce only works if your neural network has dropout layers, so let’s kick off with briefly introducing these. Dropout boils down to simply switching-off some neurons at each training step. At each step, a different set of neurons are switched off. Mathematically speaking, each neuron has some probability p of being ignored, called the dropout rate. The dropout rate is typically set to be between 0 (no dropout) and 0.5 (approximately 50% of all neurons will be switched off). The exact value depends on the network type, layer size, and the degree to which the network overfits the training data.

But why do this? Dropout is a regularization technique, that is, it helps prevent overfitting. With little data and/or a complex network, the model might memorize the training data and, as a result, work great on the data it has seen during training but deliver terrible results on new, unseen data. This is called overfitting, and dropout seeks to alleviate it.
How? There are two ways to understand why switching off some parts of the model might be beneficial. First, the information spreads out more evenly across the network. Think about a single neuron somewhere inside the network. There are a couple of other neurons that provide it with inputs. With dropout, each of these input sources can disappear at any time during training. Hence, our neuron cannot rely on one or two inputs only, it has to spread out its weights and pay attention to all inputs. As a result, it becomes less sensitive to input changes which results in the model generalizing better.
The other explanation of dropout’s effectiveness is even more important from the point of view of our Monte Carlo trick. Since in every training iteration you randomly sample the neurons to be dropped out in each layer (according to that layer’s dropout rate), a different set of neurons are being dropped out each time. Hence, each time the model’s architecture is slightly different and you can think of the outcome as an averaging ensemble of many different neural networks, each trained on one batch of data only.
A final detail: dropout is only used during training. At inference time, that is when we make predictions with our network, we typically don’t apply any dropout — we want to use all the trained neurons and connections.

Monte Carlo
Now that we have dropout out of the way, what is Monte Carlo? If you’re thinking about a neighborhood in Monaco, you’re right! But there is more to it.
In statistics, Monte Carlo refers to a class of computational algorithms that rely on repeated random sampling to obtain a distribution of some numerical quantity.

Monte Carlo Dropout: model accuracy
Monte Carlo Dropout, proposed by Gal & Ghahramani (2016), is a clever realization that the use of the regular dropout can be interpreted as a Bayesian approximation of a well-known probabilistic model: the Gaussian process. We can treat the many different networks (with different neurons dropped out) as Monte Carlo samples from the space of all available models. This provides mathematical grounds to reason about the model’s uncertainty and, as it turns out, often improves its performance.
How does it work? We simply apply dropout at test time, that's all! Then, instead of one prediction, we get many, one by each model. We can then average them or analyze their distributions. And the best part: it does not require any changes in the model’s architecture. We can even use this trick on a model that has already been trained! To see it working in practice, let’s train a simple network to recognize digits from the MNIST dataset.






