avatarMs Aerin

Summary

Bayesian inference is a statistical method that updates the probability of an event as more data is gathered, combining prior knowledge with new evidence to make more informed predictions.

Abstract

Bayesian inference is a powerful approach in statistics that allows for the updating of probabilities as new data becomes available. It involves combining prior distributions with likelihood functions to derive a posterior distribution, which represents the updated probabilities. This method is particularly useful for making predictions about uncertain events, such as predicting the percentage of people who will engage with a blog post based on past data. The process includes choosing appropriate probability distributions for the prior and likelihood, and then calculating the posterior distribution. The posterior distribution is considered smarter than the prior because it incorporates the new data. In practice, Bayesian inference can be applied to a wide range of problems, including click-through rates, conversion rates, and medical survival rates. The example provided demonstrates how to implement Bayesian inference using Python, where the Beta distribution is used as a prior for the probability of clapping on a blog post, and the Binomial distribution models the likelihood of the observed clapping data. The posterior distribution is then calculated and visualized, showing how the new data shifts and narrows the distribution, leading to more precise estimates.

Opinions

  • The author suggests that Bayesian inference is superior to classic maximum likelihood estimation (MLE) because it takes prior knowledge into account, making it more informative and adaptable to new data.
  • The choice of probability distributions for modeling parameters and data is crucial in Bayesian inference.
  • Visualizing the prior, likelihood, and posterior distributions helps in understanding how new data influences the updated probabilities.
  • The use of conjugate priors, such as the Beta distribution for Binomial data, simplifies the calculation of the posterior distribution and is computationally efficient.
  • The author emphasizes the importance of normalization in comparing posteriors from different models or calculating point estimates, but also notes that finding the maximum posterior does not strictly require normalization.
  • MAP estimation is presented as a regularized version of MLE, incorporating additional information through the prior, which can be seen as a form of regularization.
  • The iterative nature of Bayesian inference is highlighted, where the posterior becomes the new prior with the arrival of new data, facilitating continuous learning and updating of beliefs.

Bayesian Inference — Intuition and Example

with Python Code

Why did someone need to invent the Bayesian Inference?

In a nutshell, Bayesian inference was invented to update probability as we gather more data.

The essence of Bayesian Inference is to combine two different distributions (likelihood and prior) into one “smarter” distribution (posterior). The posterior is considered “smarter” because, unlike classic maximum likelihood estimation (MLE), it takes into account the prior. Once we calculate the posterior, we use it to find the optimum parameters that maximize the posterior probability, given the data. This process is called as Maximum A Posteriori (MAP). The optimization techniques used in MAP are the same as the ones used in typical machine learning, such as gradient descent or Newton’s method, etc.

Take a look at the famous Bayes rule’s equation. It’s pretty straightforward in an analytical sense. But how do we implement this equation with data?

One of the most famous equations in the world of Statistics

In the real world, we will have a large number of data points X. How do we multiply the probability wrt X by the probability wrt θ?

The art of Bayesian inference seems to be in how you implement it.

Bayesian Inference in Practice:

Let’s consider an example: I have about 2,000 readers per day visiting my Medium blog. Some readers clap after reading the articles, and some don’t. I’d like to make predictions about the percentage of people who will engage and clap when I write a new blog post in the future.

This type of problem is prevalent and can be applied to various modeling tasks, such as the click-through rate (CTR) of an advertisement, the conversion rate of customers actually purchasing on your website, the percentage of people who would agree to go on a date with you, or the survival chance for women with breast cancer.

Generating Data:

First, let’s generate the data X. In real life, you have no control over X — this is what you are going to observe.

import numpy as np
np.set_printoptions(threshold=100)
# Generating 2,000 readers' responses 
# Assuming the claps follow a Bernoulli process - a sequence of binary (success/failure) random variables.
# 1 means clap. 0 means no clap.
# We pick a success rate of 30%.
clap_prob = 0.3
# IID (independent and identically distributed) assumption
clap_data = np.random.binomial(n=1, p=clap_prob, size=2000)

The clapping data is binary: 1 indicates a clap, and 0 means no clap.

The data X looks like:

In [1]: clap_data
Out[1]: array([0, 0, 0, ..., 0, 1, 0])
In [2]: len(clap_data)
Out[2]: 2000

Bayesian Inference has three steps.

Step 1. Prior P(θ): Choose a PDF to model your parameter θ, aka the prior distribution P(θ). This is your best guess about parameters *before* seeing the data X.

Step 2. Likelihood P(X|θ): Choose a PDF for P(X|θ). Basically you are modeling how the data X will look like, given the parameter θ.

Step 3. Posterior P(θ|X): Calculate the posterior distribution P(θ|X) and select the θ that has the highest P(θ|X).

And the posterior has become the new prior! Repeat step 3 as you gather more data.

Step 1. Prior P(θ)

The first step is to choose the PDF to model the parameter θ.

What does the parameter θ represent?

It’s the clapping 👏 *probability*.

Then, in order to model a probability, what kinds of probability distributions should we use?

There are a few prerequisites for probability distributions to be an appropriate contender for expressing probability. First, it should have a domain between 0 and 1. Second, the distribution should be continuous.

Then there are two well-known probability distributions that I can think of:

Beta and Dirichlet.

Dirichlet is for multivariate, and Beta is for univariate. Since we have only one thing to predict, which is a probability, let’s use the Beta distribution.

An interesting side note: It’s easy to make an arbitrary distribution over (0,1). Just take any function that doesn’t blow up anywhere between 0 and 1 and stays positive. Then, simply integrate it from 0 to 1 and divide the function by that result.

To use the Beta distribution, we need to determine two parameters, α and β. You can think of α as how many people clap (the number of successes) and β as how many people don’t clap (the number of failures). These parameters — how big or small α & β are — will determine the shape of the distribution.

Let’s say you have data from yesterday, and observed that 400 people out of 2,000 visitors clapped.

How would you write this using the beta distribution?

import scipy.stats as stats
import matplotlib.pyplot as plt
a = 400
b = 2000 - a
# domain θ
theta_range = np.linspace(0, 1, 1000)
# prior distribution P(θ)
prior = stats.beta.pdf(x = theta_range, a=a, b=b)

Now, let’s plot the prior distribution with respect to all θ values.

# Plotting the prior distribution
plt.rcParams['figure.figsize'] = [20, 7]
fig, ax = plt.subplots()
plt.plot(theta_range, prior, linewidth=3, color='palegreen')
# Add a title
plt.title('[Prior] PDF of "Probability of Claps"', fontsize=20)
# Add X and y Label
plt.xlabel('θ', fontsize=16)
plt.ylabel('Density', fontsize=16)
# Add a grid
plt.grid(alpha=.4, linestyle='--')
# Show the plot
plt.show()
Not familiar with the term “Density” on Y-axis? → Read “PDF is NOT a probability”

As expected, the plot peaks at 20% (400 claps / 2000 readers). Two thousand data points seem to produce a strong prior. If we had fewer data points, say, 100 readers, the curve would be less spiky. You can try this with α = 20 & β = 80.

For those wondering how probability density can be greater than 1. 👉 Read: Probability Density is NOT a probability.

Step 2. Likelihood P(X|θ)

Choose a probability model for P(X|θ), the probability of seeing the data X given a particular parameter θ. Likelihood is also called a sampling distribution. To me, the term “sampling distribution” is much more intuitive than “likelihood”.

To choose which probability distribution to use to model the sampling distribution, we first need to ask:

What does our data X look like?

X is a binary array [0,1,0,1,...,0,0,0,1].

We also have the total number of visitors (n) and we want the probability of clapping (p).

Ok, n & p What do they scream at you?

Binomial distribution with n & p.

# The sampling distribution P(X|θ) with a prior θ
likelihood = stats.binom.pmf(k = np.sum(clap_data), n = len(clap_data), p = a/(a+b))

Is it likely that our prior assumption θ is true?

In [63]: likelihood
Out[63]: 4.902953768848812e-30

Nope.

Visualizing the Likelihood P(X|θ) for all possible θ

# Likelihood P(X|θ) for all θ's
likelihood = stats.binom.pmf(k = np.sum(clap_data), n = len(clap_data), p = theta_range)
# Create the plot
fig, ax = plt.subplots()
plt.plot(theta_range, likelihood, linewidth=3, color='yellowgreen')
# Add a title
plt.title('[Likelihood] Probability of Claps' , fontsize=20)
# Add X and y Label
plt.xlabel(’θ’, fontsize=16)
plt.ylabel(’Probability’, fontsize=16)
# Add a grid
plt.grid(alpha=.4, linestyle='--')
# Show the plot
plt.show()
Likelihood P(X|θ) given the data

Step 3. Posterior. P(θ|X)

Finally, let’s answer the question we asked in the beginning:

In the real world, we will have a large number of data points X. How do we multiply the probability wrt X by the probability wrt θ?

Even though there are thousands of data points, we can convert them into a single scalar — the likelihood P(X|θ) by plugging data into the model that you chose (in this example, the binomial distribution).

Then, we calculate P(θ) and P(X|θ) for a specific θ and multiply them together. If you do this for every possible θ, you can pick the highest P(θ) * P(X|θ) among different θ’s.

Your initial guess about parameters was P(θ). Now you are upgrading a simple P(θ) into something more informative — P(θ|X) — as more data becomes available. P(θ|X) is still the probability of θ, just like P(θ). However, P(θ|X) is a smarter version of P(θ).

The following code demonstrates this process:

# (cont.)
theta_range_e = theta_range + 0.001 
prior = stats.beta.cdf(x = theta_range_e, a=a, b=b) - stats.beta.cdf(x = theta_range, a=a, b=b) 
# prior = stats.beta.pdf(x = theta_range, a=a, b=b)
likelihood = stats.binom.pmf(k = np.sum(clap_data), n = len(clap_data), p = theta_range) 
posterior = likelihood * prior # element-wise multiplication
normalized_posterior = posterior / np.sum(posterior)
In [74]: np.argmax(prior)
Out[74]: 199
In [75]: np.argmax(likelihood)
Out[75]: 306
In [76]: np.argmax(posterior)
Out[76]: 253
# Plotting all three together
fig, axes = plt.subplots(3, 1, sharex=True, figsize=(20,7))
plt.xlabel('θ', fontsize=24)
axes[0].plot(theta_range, prior, label="Prior", linewidth=3, color='palegreen')
axes[0].set_title("Prior", fontsize=16)
axes[1].plot(theta_range, likelihood, label="Likelihood", linewidth=3, color='yellowgreen')
axes[1].set_title("Sampling (Likelihood)", fontsize=16)
axes[2].plot(theta_range, posterior, label='Posterior', linewidth=3, color='olivedrab')
axes[2].set_title("Posterior", fontsize=16)
plt.show()

When you look at the posterior graph (the 3rd one), notice how the likelihood shifted toward the prior. The clapping probability for the prior was 20%. The clapping probability for the data was given as 30%. Now, the posterior has its peak around 0.25%.

Also, look at the width of the bell curves in the prior and likelihood graphs; it has narrowed in the posterior graph. This is because we incorporated more information through sampling, thus reducing the range of possible parameters.

The more data you gather, the more the posterior graph will resemble the likelihood graph and deviate from the prior graph. In other words, as you get more data, the original prior distribution matters less and less.

Finally, we select θ that gives the highest posterior, computed using numerical optimization methods like Gradient Descent or the Newton method. This iterative procedure is called Maximum A Posteriori estimation (MAP).

Footnote:

  1. We calculated the prior by subtracting two stats.beta.cdf values instead of using stats.beta.pdf because the likelihood stats.binom.pmf is a probability, while stats.beta.pdf returns a density. Even if we use the density to calculate the posterior, it won’t change the optimization result. However, if you want the units to match, converting density into probability is necessary. (Still have questions? Read “PDF is NOT a probability, while PMF IS a probability”.)
  2. MAP can be calculated using methods other than numerical optimization. The closed-form (for certain distributions only), modified expectation maximization algorithm, or Monte Carlo method can also be used.
  3. Using a closed-form to compute a posterior is very convenient since it eliminates the need for costly calculations. Look at the integration for every possible θ in the denominator of the Bayes formula. Very Expensive! In the given example, the beta distribution is a conjugate prior of the binomial likelihood. This means that we already know, during the modeling phase, that the posterior will also be a beta distribution. Then, after carrying out more experiments, we can compute the posterior simply by adding the number of successes and failures to the current parameters α and β respectively. When can we use this simple and elegant closed-form to calculate the posterior calculation? It is applicable when the prior has a closed-form (of course) AND also when it is a conjugate prior. A conjugate prior is defined as a prior that, when multiplied by the likelihood and divided by the normalizing constant, maintains its original distribution.
  4. Is the normalization (integrating over for all possible θ values in the denominator) necessary for finding the maximum posterior?

No. You can still find the maximum posterior without normalization. However, if you want to compare posteriors from different models or calculate point estimates, normalization is necessary.

5. MAP estimation can be viewed as a regularized ML approach that incorporates additional information through the prior.

6. Are you familiar with maximum likelihood estimation (MLE) but not quite with maximum a posteriori (MAP)? MLE is just a special case of MAP with a uniform prior.

7. While I wrote, Prior is your best guess about parameters *before* seeing the data”, in practice, once the posterior is calculated, it becomes the new prior until a new batch of data comes in. This way, we can iteratively update our prior and posterior. Markov Chain Monte Carlo sampling methods are based on this idea.

Machine Learning
Data Science
Statistics
Artificial Intelligence
Mathematics
Recommended from ReadMedium