Courage to Learn ML: Decoding Likelihood, MLE, and MAP

With A Tail of Cat Food Preferences

Welcome to the ‘Courage to learn ML’. This series aims to simplify complex machine learning concepts, presenting them as a relaxed and informative dialogue, much like the engaging style of “The Courage to Be Disliked,” but with a focus on ML.

In this installment of our series, our mentor-learner duo dives into a fresh discussion on statistical concepts like MLE and MAP. This discussion will lay the groundwork for us to gain a new perspective on our previous exploration of L1 & L2 Regularization. For a complete picture, I recommend reading this post before reading the fourth part of ‘Courage to Learn ML: Demystifying L1 & L2 Regularization’.

This article is designed to tackle fundamental questions that might have crossed your path in Q&A style. As always, if you find yourself have similar questions, you’ve come to the right place:

What exactly is ‘likelihood’?
The difference between likelihood and probability
Why is likelihood important in the context of machine learning?
What is MLE (Maximum Likelihood Estimation)?
What is MAP (Maximum A Posteriori Estimation)?
The difference between MLE and Least square
The Links and Distinctions Between MLE and MAP

What exactly is ‘likelihood’?

Likelihood, or more specifically the likelihood function, is a statistical concept used to evaluate the probability of observing the given data under various sets of model parameters. It is called likelihood (function) because it’s a function that quantifies how likely it is to observe the current data for different parameter values of a statistical model.

Likelihood seems similar to probability. Is it a form of probability, or if not, how does it differ from probability?

The concepts of likelihood and probability are fundamentally different in statistics. Probability measures the chance of observing a specific outcome in the future, given known parameters or distributions. In this scenario, the parameters or the distribution are known, and we’re interested in predicting the probability of various outcomes. Likelihood, in contrast, measures how well a set of underlying parameters explains the observed outcomes. In this setting, the outcomes are already observed, and we seek to understand what underlying parameters or conditions could have led to these outcomes.

To illustrate this with an intuitive example, consider my cat Bubble’s preference for chicken over beef.

When I buy cat food, I choose more chicken-flavored cans because I know there’s a higher probability she will enjoy them and finish them all. This is an application of probability, where I use my knowledge of Bubble’s preferences to predict future outcomes. However, Bubble’s preference is not something she explicitly communicates. I inferred it by observing her eating habits over the past six years. Noticing that she consistently eats more chicken than beef indicates a higher likelihood of her preferring chicken. This inference process is an example of using likelihood.

It’s important to note that, in statistics, likelihood is a function. This function calculates the probability that a particular set of parameters is the most suitable explanation for the observed data. Unlike probability, the values of a likelihood function do not necessarily sum up to 1. This is because probability deals with the sum of all possible outcomes for given parameters, which must be 1, while likelihood deals with how probable different parameter sets are given the observed data.

Why is likelihood important in the context of machine learning?

Understanding the application of likelihood in the machine learning context requires us to consider how we evaluate model results. Essentially, we need a set of rules to judge between different sets of parameters. There are two primary approaches to measure how well a model, with its current parameters, explains the observed data:

The first method involves using a difference-based approach. We compare each true label with the corresponding prediction and attempt to find a set of model parameters that minimizes these differences. This is the basic idea behind the least squares method, which focuses on error minimization.

The second method is where likelihood, specifically Maximum Likelihood Estimation (MLE), comes into play. MLE seeks to find a set of parameters that makes the observed data most probable. In other words, by observing the data, we choose parameters that maximize the likelihood of observing the current data set. This approach goes beyond just minimizing error; it considers the probability and models the uncertainty in parameter estimation.

In Maximum Likelihood Estimation (MLE), the underlying assumption is that the optimal parameters for a model are those that maximize the likelihood of observing the given dataset.

In summary, while the least squares method and MLE differ in their approaches — one being error-minimizing and the other probabilistic — both are essential in the machine learning toolkit for parameter estimation and model evaluation. We will explore these methods further, discussing their differences and connections, in future posts.

Could you provide an intuitive example to contrast those two evaluation approaches (MLE vs. least squares)?

Considering my cat Bubble’s preference for food, let’s say I initially assume she likes chicken and beef equally. To test this using the least squares method, I would collect data by buying an equal number of chicken and beef flavored cans. As Bubble eats, I’d record how much of each she consumes. The least squares method would then help me adjust my initial assumption (parameters) by minimizing the difference (squared error) between my prediction (equal preference) and the actual consumption pattern (true labels).

For the MLE approach, instead of starting with an assumption about Bubble’s preference, I would first observe her eating habits over time. Based on this data, I’d use MLE to find the parameter values (in this case, preference for chicken or beef) that make the observed data most probable. For example, if Bubble consistently chooses chicken over beef, the MLE method would identify a higher probability for chicken preference.

So MLE uses likelihood to select parameters. What are their mathematical representations?

In Maximum Likelihood Estimation (MLE), the primary goal is to identify the set of parameters (θ) that most likely produces the observed data. This process involves defining the likelihood function, denoted as L(θ) or L(θ∣x), where x represents the observed data. The likelihood function calculates the probability of observing the given data x assuming the model parameters are θ.

The essence of MLE is to find the parameter values that maximize the likelihood function. Mathematically, this is its representation and calculation process:

The post will explore this equation in depth in a subsequent questions.

Hold on a moment… we define likelihood as L(θ) = p(x|θ), signifying the probability of observing the data x given a set of parameters θ. But earlier, we mentioned that likelihood involves having a set of observations and then calculating the likelihood for a set of parameters. Shouldn’t it be L(θ) = p(θ|x) instead?

In understanding MLE, it’s crucial to distinguish between the likelihood function and probability. The likelihood function, denoted as L(θ), is not the same as the probability p(θ∣x).

While p(θ∣x) refers to the probability of the parameter values θ given the observed data x (a concept central to Bayesian inference), L(θ) is about the likelihood function, which evaluates how plausible different parameter values are in explaining the observed data.

For calculating the likelihood function, we use the probability of observing the data x given certain parameter values θ, denoted as p(x∣θ). This probability function is used to assess the adequacy of different parameter settings. Therefore, in MLE, we have L(θ)=p(x∣θ). It’s important to interpret this equation correctly: the equal sign here signifies that we calculate the likelihood L(θ) using the probability p(x∣θ); it does not imply a direct equivalence between L(θ) and p(x∣θ)

In summary, L(θ) quantifies how well the parameters θ explain the data x, while p(θ∣x) is about the probability of the parameters after observing the data. Understanding this distinction is fundamental to grasping the principles of MLE and its application in statistical modeling.

But wouldn’t using p(θ|x) provide a more direct evaluation of which parameter set is better, instead of relying on the likelihood function?

I’m glad you noticed this important distinction. Theoretically, calculating ( p(θ|x) ) for different parameter sets (θ) and choosing the one with the highest probability would indeed provide a direct evaluation of which set of parameters is better. This is achievable through Bayes’ theorem, which helps in computing the posterior probability ( p(θ|x) ).

To calculate this posterior, we consider three key elements:

Likelihood ( p(x|θ) ) : This represents how probable the observed data is given a set of parameters. It’s the basis of MLE, focusing on how well the parameters explain the observed data.
Prior ( p(θ) ) : This reflects our initial beliefs about the parameters before observing any data. It’s an essential part of Bayesian inference, where prior knowledge about parameter distribution is factored in.
Marginal Likelihood or Evidence ( p(x) ): This measures how probable the observed data is under all possible parameter sets, essentially assessing the probability of observing the data without making specific assumptions about parameters.

In practice, the marginal likelihood ( p(x) ) can often be ignored, especially when comparing different sets of parameters, as it remains constant and doesn’t influence the relative comparison.

With Bayes’ theorem, we find that the posterior ( p(θ|x) ) is proportional to the product of the likelihood and the prior, ( p(x|θ)* p(θ) ).

This means to compare different parameter sets, we must consider both our prior beliefs about the parameters and the likelihood, which is how the observed data modifies our beliefs. Like MLE, in MAP (Maximum A Posteriori Estimation), we seek to maximize the posterior to find the best set of model parameters, integrating both prior knowledge and observed data.

So, MAP incorporates an additional element, which is our prior belief about the parameter.

Correct. MAP indeed uses an extra piece of information, which is our prior belief about the parameters. Let’s use the example of my cat Bubble (again) to illustrate this. In the context of MAP, when determining Bubble’s preferred food flavor — beef or chicken — I would consider a hint from the breeder. The breeder mentioned that Bubble likes to eat boiled chicken breast, so this information forms my prior belief that Bubble may prefer chicken flavor. Consequently, when initially choosing her food, I would lean towards buying more chicken-flavored food. This approach of incorporating the breeder’s insight represents the ‘prior’ in MAP estimation.

I understand that MAP and MLE are related, with MAP adding in our assumption about the parameter. Can you offer a more straightforward example to show me the difference and connections between those two methods?

To demonstrate the connection between MAP and MLE, I’ll introduce some mathematical formulas. While the goal of this discussion is to intuitively explain machine learning concepts through dialogue, showcasing these functions will help. Don’t fret over complexity; these formulas simply highlight the extra insights MAP offers compared to MLE for a clearer understanding.

Maximum Likelihood Estimation (MLE) focuses on identifying the parameter set θ that makes the observed data x most probable. It achieves this by maximizing the likelihood function P(X|θ).

However, directly maximizing the product of probabilities, which are typically less than 1, can be impractical due to computational underflow — a condition where numbers become too small to be represented accurately. To overcome this, we use logarithms, transforming the product into a sum. Since the logarithm function is monotonically increasing, maximizing a function is equivalent to maximizing its logarithm. Thus, the MLE formula often involves the sum of the logarithms of probabilities.

On the other hand, Maximum A Posteriori (MAP) estimation aims to maximize the posterior probability. Applying Bayes’ theorem, we see that maximizing the posterior is equivalent to maximizing the product of the prior probability P(θ) and the likelihood. Like in MLE, we introduce logarithms to simplify computation, converting the product into a sum.

The primary distinction between MLE and MAP lies in the inclusion of the prior P(θ) in MAP. This addition means that in MAP, the likelihood is effectively weighted by the prior, influencing the estimation based on our prior beliefs about the parameters. In contrast, MLE does not include such a prior and focuses solely on the likelihood derived from the observed data.

It seems like MAP might be superior to MLE. Why don’t we always opt for MAP then?

MAP estimation incorporates our pre-existing knowledge about the parameters distribution. But it doesn’t inherently make it superior to MLE. There are several factors to consider:

An assumption about the parameter distribution isn’t always available. In cases where the parameter distribution is assumed to be uniform, MAP and MLE yield equivalent results.
The computational simplicity of MLE often makes it a more practical choice. While MAP provides a comprehensive Bayesian approach, it can be computationally intensive.
MAP’s effectiveness heavily relies on the selection of an appropriate prior. An inaccurately chosen prior can lead to increased computational costs for MAP to identify an optimal set of parameters.

In our next session, our mentor-learner team will return to delve deeper into L1 and L2 regularization. Armed with a solid understanding of MLE and MAP, we’ll be able to view L1 and L2 regularization from a fresh perspective. Looking forward to seeing you in the next post!

MLE vs MAP: the connection between Maximum Likelihood and Maximum A Posteriori Estimation

In this post, we will see what is the difference between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori…

agustinus.kristia.de

Difference between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) - Amir Masoud…

Difference between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)

- Amir Masoud… Difference between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)www.sefidian.com

What is the likelihood function, and how is it used in particle physics?

In statistical data analysis, "likelihood" is a key concept that is closely related to, but crucially distinct from…

ep-news.web.cern.ch