avatarRayhaan Rasheed

Summarize

Image from Seo et. al (2006)

Introduction to Bayesian Decision Theory

A Statistical Approach to Machine Learning

Introduction

Whether you are building Machine Learning models or making decisions in everyday life, we always choose the path with the least amount of risk. As humans, we are hardwired to take any action that helps our survival; however, machine learning models are not initially built with that understanding. These algorithms need to be trained and optimized to choose the best option with the least amount of risk. Additionally, it is important to know that some risky decisions can lead to severe consequences if they are not correct.

Image by Forbes

Consider the problem of cancer detection. Based on a patient's computerized tomography (CT) scan, can a radiologist determine the presence of a tumor? If they believe there is a tumor in the patient, then the physician needs to figure out if the tumor is benign or malignant to determine the proper treatment. Since the purpose of this article is to describe the statistical approach for making these decisions, I will only focus on breaking down the first part of the problem: is there a tumor, yes or no?

Bayes’ Theorem

One of the most well-known equations in the world of statistics and probability is Bayes’ Theorem (see formula below). The basic intuition is that the probability of some class or event occurring, given some feature (i.e. attribute), is calculated based on the likelihood of the feature’s value and any prior information about the class or event of interest. This seems like a lot to digest, so I will break it down for you. First off, the case of cancer detection is a two-class problem. The first class, ω1, represents the event that a tumor is present, and ω2 represents the event that a tumor is not present.

Prior

There are four parts to Bayes’ Theorem: Prior, Evidence, Likelihood, and Posterior. The priors(P(ω1), P(ω2)), define how likely it is for event ω1 or ω2 to occur in nature. It is important to realize the priors vary depending on the situation. Since the objective is to detect cancer, it is safe to say that the probability of a tumor being present is pretty low: P(ω1)

Likelihood

From a high level, a CT scan is when x-rays are applied in a circular motion. One of the key metrics that is produced is attenuation — a measurement of x-ray absorption. Objects with a higher density have a higher attenuation and vice-versa. Therefore, a tumor is more likely to have a high attenuation compared to lung tissue.

Suppose you only look at attenuation values to help make your decision between ω1 and ω2. Each class has a class-conditional probability density, p(x|ω1) and p(x|ω2), called likelihoods. The figure below shows a hypothetical class-conditional probability density for p(x|ω). These distributions are extracted by analyzing your training data; however, it is always good to have domain expertise to check the validity of the data.

Photo from Pattern Recognition by Duda, Hart & Stork

Evidence

The best way to describe the evidence, p(x), is through the law of total probability. This law states that if you have mutually exclusive events (e.g. ω1 and ω2) whose probability of occurrence sum up to 1, then the probability of some feature (e.g. attenuation) is the likelihood times the prior summed across all mutually exclusive events.

Posterior

The result of using Bayes’ Theorem is called the posterior, P(ω1|x) and P(ω2|x). The posterior represents the probability that an observation falls into class ω1 or ω2 (i.e tumor present or not) given the measurement x (e.g. attenuation). Each observation receives a posterior probability for every class, and all the posteriors must add up to 1. In regards to the cancer detection problem we are trying to solve, there are two posterior probabilities. The image below is a hypothetical scenario of how the posterior values could change with respect to a measurement x. In addition to a connection between the likelihoods and the posteriors, the posterior can be heavily affected by prior P(ω).

Decision Rules

Now that we have a good understanding of Bayes’ theorem, it’s time to see how we can use it to make a decision boundary between our two classes. There are two methods for determining whether a patient has a tumor present or not. The first is a basic approach that only uses the prior probability values to make a decision. The second way utilizes the posteriors, which takes advantage of the priors and class-conditional probability distributions.

Using the Priors

Suppose we only make a decision based on the natural prior probabilities. This means we forget about all the other factors in Bayes’ Theorem. Since the probability of having a tumor, P(ω1), is far less than not having one P(ω2), our model/system will always decide that every patient does not have a tumor. Even though the model/system will be correct most of the time, it will not identify the patients who actually have a tumor and need proper medical attention.

Using the Posteriors

Now let’s take a more comprehensive approach by using the posteriors, P(ω1|x) and P(ω2|x). Since the posteriors are a result of Bayes’ Theorem, the impact of the priors is mitigated by the class-conditional probability densities, p(x|ω1) and p(x|ω2). If our model/system is looking at a region with a higher attenuation than ordinary tissue, then the probability of a tumor being present increases despite the natural prior probabilities. Let’s assume there is a 75% chance that a specific region contains a tumor, then that would mean there is a 25% chance there is no tumor at all. That 25% chance is our probability of error, also known as risk.

Conclusion

What you have just learned is a simple, univariate application of Bayesian Decision Theory that can be expanded onto a larger feature space by using the multivariate Gaussian distribution in place of the evidence and likelihood. Although this article focused on tackling the problem of cancer detection, Bayes’ Theorem is used in a variety of disciples including investing, marketing, and systems engineering.

Resources

[1]Seo, Young-Woo. (2006). Cost-Sensitive Access Control for Illegitimate Confidential Access by Insiders. Proceedings of IEEE Intelligence and Security Informatics: 23–24 May 2006. 3975. 117–128. 10.1007/11760146_11.

[2] Duda, R. O., Hart, P. E., Stork, D. G. (2001). Pattern Classification. New York: Wiley. ISBN: 978–0–471–05669–0

[3] Glatter, R., “Medicare To Cover Low-Dose CT Scans For Those At High Risk For Lung Cancer”, Forbes (2015)

Data Science
Statistics
Healthcare
Machine Learning
Decision Making
Recommended from ReadMedium