avatarKumar Shridhar

Summary

The provided content introduces an eight-post series on Bayesian Neural Networks, emphasizing the need for probabilistic approaches in deep learning to address overconfidence and overfitting, and outlines the practicality and methods for implementing Bayesian Convolutional Neural Networks.

Abstract

The article kicks off a comprehensive series on Bayesian Neural Networks (BNNs), particularly focusing on Bayesian Convolutional Neural Networks (BCNNs). It underscores the necessity for BNNs to mitigate the issues of overconfidence and overfitting prevalent in traditional neural networks. The series promises to cover essential background knowledge, recent advancements, and practical applications of BNNs, including uncertainty estimation and model pruning. The article highlights the limitations of point-estimate neural networks and advocates for the Bayesian approach to provide uncertainty measures and inherent regularization, making models more robust, especially in domains like healthcare where accurate uncertainty estimates are crucial. It also discusses the computational challenges of Bayesian methods, particularly the difficulty of performing posterior inference in deep neural networks, and suggests variational inference as a practical solution. The author plans to delve into the implementation of BCNNs using Bayes by Backprop, a technique that allows for efficient approximation of the true posterior, and will demonstrate the applicability of BCNNs to various tasks, architectures, and domains, including image super-resolution and generative adversarial networks.

Opinions

  • Traditional neural networks, particularly those with point-estimate weights, are prone to overfitting and making overconfident predictions, which is problematic in sensitive domains.
  • Bayesian Neural Networks offer a solution by representing weights as probability distributions, providing a measure of uncertainty and acting as a regularizer during training.
  • Despite the theoretical appeal of Bayesian methods, practical implementation has been challenging due to the computational complexity and the vast number of parameters in deep models, especially CNNs.
  • Variational inference, particularly the Bayes by Backprop algorithm, is presented as a promising approach to make Bayesian Neural Networks more computationally feasible.
  • The author suggests that Bayesian CNNs can learn richer representations and make more reliable predictions through model averaging, while also being able to estimate different types of uncertainties.
  • The series is anticipated to demonstrate the practicality of Bayesian CNNs by showcasing their performance and versatility across various applications and by comparing them to traditional point estimate networks.
  • The author expresses enthusiasm about the upcoming content in the series, inviting readers to follow along for a deeper understanding of Bayesian deep learning and its potential impact on the field.

Bayesian Neural Network Series Post 1: Need for Bayesian Neural Networks

Figure 1: Network with point-estimates as weights vs Network with probability distribution as weights. Source

This post is the first post in an eight-post series of Bayesian Convolutional Networks. The posts will be structured as follows:

  1. Need for Bayesian Neural Networks
  2. Background knowledge needed to understand Bayesian Neural Networks better
  3. Some recent work in the field of Bayesian Neural Networks
  4. Bayesian Convolutional Neural Networks using Variational Inference
  5. Build your own Bayesian Convolutional Neural Network in PyTorch
  6. Uncertainty estimation in a Bayesian Neural Networks
  7. Model Pruning in a Bayesian Neural Network
  8. Applications in other areas (Super Resolution, GANs and so on..)

The blogs will be released every month starting first-week January 2020. Stay tuned! Also, feel free to check the post on ‘Why the World needs a Bayesian perspective’.

Let’s start this series by understanding the need for Bayesian Neural Networks in this blog.

Problem Statement

Deep Neural Networks (DNNs), are connectionist systems that learn to perform tasks by learning on examples without having prior knowledge about the tasks. They easily scale to millions of data points and yet remain tractable to optimize with stochastic gradient descent.

Convolutional Neural Networks (CNN), a variant of DNNs, have already surpassed human accuracy in the realm of image classification. Due to the capacity of CNNs to fit on a wide diversity of non-linear data points, they require a large amount of training data. This often makes CNN and neural networks in general, prone to over-fitting on datasets with small numbers of training examples per class. The model tends to fit well to the training data but does not predict well for unseen data. This often makes neural networks incapable of correctly assessing the uncertainty in the training data and hence leads to overly confident decisions about the correct class, prediction or action.

If the terms in this blog seem to a bit advanced for you, it is recommended to refresh some Deep Learning basics from here.

To understand this, let’s think of training a binary classifier CNN on dogs and cats image classes. Now, when an image of a leopard is encountered in the test dataset, the model should ideally predict it neither dog nor cat (a 50 per cent probability for dog and 50 per cent probability for cat class). However, due to a softmax function at the output layer to achieve the probability score, it squishes one class output probability score and maximizes the other, leading to an overconfident decision for one class. This is one of the major problems of a point-estimate neural network.

Note that the term point-estimate is used for a neural network where weights are represented by a single point. On the other hand, a Bayesian neural network represents the weights in form of distribution as seen in Figure 1.

But do we really need a Bayesian neural network for it? Various regularization techniques for controlling over-fitting are used in practice, namely early stopping, weight decay, L1 or L2 regularizations and currently the most popular and empirically very effective technique, dropout.

If we can solve the issue of making overconfident decisions and prevent the over-fitting of models by regularizing the model, then the question remains as it is: Why do we need a Bayesian neural network?

In short, the answer is: a measure of uncertainty in the prediction is missing from the current neural networks architectures, but Bayesian neural networks incorporate this.

Current Situation

Deep neural networks have been successfully applied to many domains, including very sensitive domains like health-care, security, fraudulent transactions and many more. These domains rely heavily on the predictions accuracy of the model and even one overconfident decision can result in a big problem. Also, these domains have very imbalanced datasets (one in a million fraudulent transactions, nearly five per cent of all tests result in positive cancer results, less than one per cent email is spam), and this leads to the model being over-fitted to the over-sampled class.

From a probability theory perspective, it is unjustifiable to use single point-estimates as weights to base any classification on. Bayesian neural networks, on the other hand, are more robust to over-fitting, and can easily learn from small datasets. The Bayesian approach further offers uncertainty estimates via its parameters in the form of probability distributions (see Figure 1). At the same time, by using a prior probability distribution to integrate out the parameters, the average is computed across many models during training, which gives a regularization effect to the network, thus preventing over-fitting.

The practicality of Bayesian neural networks

Bayesian posterior inference over the neural network parameters is a theoretically attractive method for controlling over-fitting; however, modelling a distribution over the kernels (also known as filters) of a CNN has never been attempted successfully before, perhaps because of the vast number of parameters and extremely large models commonly used in practical applications.

Even with a small number of parameters, inferring model posterior in a Bayesian neural network is a difficult task. Approximations to the model posterior are often used instead, with the variational inference being a popular approach. Here, one would model the posterior using a simple variational distribution such as a Gaussian distribution, and try to fit the distribution’s parameters to be as close as possible to the true posterior. This is done by minimising the Kullback-Leibler divergence between this simple variational distribution and the true posterior. Many have followed this approach in the past for standard neural networks models. But the variational approach used to approximate the posterior in Bayesian NNs can be fairly computationally expensive — the use of Gaussian approximating distributions increases the number of model parameters considerably, without increasing model capacity by much. Blundell et al. (2015), for example, used Gaussian distributions for Bayesian NN posterior approximation and have doubled the number of model parameters, yet report the same predictive performance as traditional approaches using dropout. This makes the approach unsuitable in practice to use with CNN as the increase in the number of parameters is too costly.

How do we do it then?

There are many ways to build the Bayesian neural networks (we will ponder over a lot of them in Blog 3). However, in this series, we will focus on building a Bayesian CNN using Bayes by Backprop. The exact Bayesian inference on the weights of a neural network is intractable as the number of parameters is very large and the functional form of a neural network does not lend itself to exact integration. So, we approximate the intractable true posterior probability distributions p(w|D) with variational probability distributions q_θ(w|D), which comprise the properties of Gaussian distributions μ ∈ ℝ^d and σ ∈ ℝ^d, denoted N(θ|μ, σ²), where d is the total number of parameters defining a probability distribution. The shape of these Gaussian variational posterior probability distributions, determined by their variance σ², expresses an uncertainty estimation of every model parameter.

Graphical Intuition of the above as defined by Graves (2011). Source

If you did not understand what exactly was said in the previous paragraph, don't worry. In the next post, we will cover all the basics needed to understand a Bayesian neural network.

What to expect over the weeks

  1. We will see how Bayes by Backprop can be efficiently applied to a CNN. We will introduce the idea of applying two convolutional operations, one for the mean and one for the variance.
  2. We will see how the model learns richer representations and predictions from multiple cheap model averaging.
  3. We will see that the proposed generic and reliable variational inference method for Bayesian CNN can be applied to various CNN architectures without any limitations on their performances. We will code the model in PyTorch and will compare the results with point estimate networks.
  4. We will estimate the aleatoric and epistemic uncertainties in a Bayesian neural network. Furthermore, we will empirically show how uncertainty decreases, allowing the decisions made by the network to become more confident as the training accuracy increases.
  5. We will learn that our method typically only doubles the number of parameters yet trains an infinite ensemble using unbiased Monte Carlo estimates of the gradients.
  6. We will apply a L1 norm to the trained model parameters and prune the number of non-zero values. Further, we fine-tune the model to reduce the number of model parameters without a reduction in the model prediction accuracy.
  7. Finally, we will apply the concept of Bayesian CNN to tasks like Image Super-Resolution and Generative Adversarial Networks and we will compare the results with other prominent architectures in the respective domain.

Exciting! Right? Go to the next post.

For all the impatient masters, check the entire theory here and implementation in PyTorch here.

Suggested Reads:

  1. Weight Uncertainty in Neural Networks
  2. Practical Variational Inference for Neural Networks
  3. Bayesian Convolution Neural Networks using Variational Inference
  4. Bayes by Backprop

Implementation:

PyTorch

Feel free to provide your valuable comments and reach me here or on Twitter.

Machine Learning
Deep Learning
Neural Networks
Artificial Intelligence
AI
Recommended from ReadMedium
avatarLouis-François Bouchard
Is OpenAI o1 Good?

o1, Strawberry, scam?

6 min read