avatarTarun Gupta

Summary

The webpage content discusses the challenges of handling continuous data and the zero frequency problem in Naive Bayes Classifiers, presenting mathematical and conceptual solutions.

Abstract

The article delves into the intricacies of applying Naive Bayes Classifiers to continuous data, suggesting discretization or Gaussian distribution assumptions for probability estimation. It also addresses the zero frequency problem, where a zero probability assignment can skew classification results, and introduces the Laplace Estimator as a solution to mitigate this issue. The author provides a detailed example using a dataset with both ordinal and numerical attributes to illustrate the classification process and the application of the Laplace Estimator under a uniform distribution assumption. The piece concludes with an invitation for readers to support the author by becoming Medium members and offers additional resources for further reading.

Opinions

  • The author emphasizes the importance of understanding how to work with continuous data in Naive Bayes Classifiers.
  • There is an acknowledgment that the performance of Naive Bayes Classifiers can be significantly affected by the zero frequency problem.
  • The Laplace Estimator is presented as a valuable tool for addressing zero probabilities in classification tasks.
  • The article suggests that a uniform distribution assumption may be necessary when applying the Laplace Estimator.
  • The author encourages readers to engage with their content by offering a free eBook on consistency and inviting them to read more of their work.
  • A recommendation is made for an AI service, ZAI.chat, as a cost-effective alternative to ChatGPT Plus (GPT-4), indicating the author's endorsement of this service.

Continuous Data and Zero Frequency Problem in Naive Bayes Classifier

How to handle it mathematically and conceptually

Photo by Kevin Ku on Unsplash

In the context of Supervised Learning (Classification), Naive Bayes or rather Bayesian Learning acts as a gold standard for evaluating other learning algorithms along with acting as a powerful probabilistic modelling technique. But, working with Naive Bayes comes with some challenges.

  • It performs well in case of categorical data as compared to numeric data. So, how do we perform classification using Naive Bayes when the data we have is continuous in nature.
  • If an instance in test data set has a category that was not present during training then it will assign it “Zero” probability and won’t be able to make prediction. This is known as Zero frequency problem. It skews the whole performance of the classification. As a Machine Learning enthusiast, everyone should know how to tackle if the situation arises.

In this post, we are going to discuss the workings of Naive Bayes classifier with Numeric / Continuous Data and the Zero frequency problem, so that it can later be applied to a real world dataset.

There are two ways to estimate the class-conditional probabilities for continuous attributes in naive Bayes classifiers:

  • We can discretize each continuous attribute and then replace the continuous attribute value with its corresponding discrete interval. This approach transforms the continuous attributes into ordinal attributes. The conditional probability P(X|Y=y), where Y is the target variable is estimated by computing the fraction of training records belonging to class y that falls within the corresponding interval for X.

The estimation error depends on the discretisation strategy, as well as the number of discrete intervals. If the number of intervals is too large, there are too few training records in each interval to provide a reliable estimate for P(X|Y). On the other hand, if the number of intervals is too small, then some intervals may aggregate records from different classes and we may miss the correct decision boundary. Hence, there is no rule of thumb on on the discretisation strategy.

  • We can assume a certain form of probability distribution for the continuous variable and estimate the parameters of the distribution using the training data. A Gaussian distribution is usually chosen to represent the class-conditional probability for continuous attributes. The distribution is characterized by two parameters, its mean and variance.
Image 1

Now, that we have established the foundation on how to use Gaussian distribution for continuous attributes, let’s see how it can be used a classifier in Machine Learning with an example:

Here is the dataset that we will use:

Data Mining by Witten, Frank and Hall

In this particular dataset, we have a total of 5 attributes. 4 of them are independent variables (Outlook, Temperature, Humidity, Windy) and one is the dependent variable (Play) that we will predict. This is a binary classification problem because the dependent variable is of boolean nature containing either yes or no. The dataset is a mix of ordinal and numerical attributes. Temperature and Humidity being numerical. Outlook and Windy being ordinal attributes.

Since this is non-deterministic or rather probabilistic approach there is no learning for the model.

We are going to classify an instance

x =

For computing this we need prior probabilities of the target variable Play

The total number of instance is 14 and 9 of them have yes as value and 5 of them has no as value.

p(yes) = 9/14

p(no) = 5/14

In accordance to the target variable, the distribution of independent variable can be written as:

Image 2

In order to classify the instance x, we need to calculate the maximum likelihood for both play=yes and play=no as follows:

likelihood for play=yes

P(x/yes) * P(yes) = P(sunny/yes) * P(Temperature=66/yes) * P(Humidity=90/yes) * P(True/yes) * P(yes)

likelihood for play=no

P(x/no) * P(no) = P(sunny/no) * P(Temperature=66/no) * P(Humidity=90/no) * P(True/no) * P(no)

The attributes individual probabilities are multiplied because of the naive independent assumption.

For the attributes Temperature and Humidity the probability can be computed using the Gaussian distribution formula in Image 1 by inserting the mean and variance values for the attributes from Image 2.

The values needed to calculate the above equations are:

P(sunny/yes) = 2/9

P(Temperature=66/yes) = 0.034

P(Humidity=90/yes) = 0.0221

P(True/yes) = 3/9

and

P(sunny/no) = 3/5

P(Temperature=66/no) = 0.0279

P(Humidity=90/no) = 0.0381

P(True/no) = 3/5

P(x/yes) * P(yes) = (2/9) * 0.034 * 0.0221 * (3/9) * (9/14) = 0.000036

P(x/no) * P(no) = (3/5) * 0.0279* 0.0381* (3/5) * (5/14) = 0.008137

0.008137 > 0.000036

Classification — NO

Now, that we have moved handling of continuous / numeric data in Naive Bayes Classifier, let’s dive into how to handle the Zero Frequency problem.

It occurs when any condition having zero probability in the whole multiplication of the likelihood makes the whole proabability zero. In such a case, there is something called Laplace Estimator is used.

Image 3

where,

nc = number of instances where xi = x and yi = y,

n = number of instances where yi = y,

p = prior estimate, example: when assuming a uniform distribution of attribute values p=1/m, with m defining the number of different (unique) attribute values.

m = number of unique values for that attribute.

So, if a uniform distribution is assumed the formula in Image 3 modifies to the following:

Image 4

The explanation for the formula in Image 3 can be a bit difficult to wrap your head around when it is unseen. Let’s understand it better with the help of an example:

We are going to classify an instance using the same dataset and distribution in Image 1 and Image 2.

x =

For computing this we need prior probabilities of the target variable Play

The total number of instance is 14 and 9 of them have yes as value and 5 of them has no as value.

p(yes) = 9/14

p(no) = 5/14

In accordance to the target variable, the distribution of independent variable can be written as:

In order to classify the instance x, we need to calculate the maximum likelihood for both play=yes and play=no as follows:

likelihood for play=yes

P(x/yes) * P(yes) = P(overcast/yes) * P(Temperature=66/yes) * P(Humidity=90/yes) * P(True/yes) * P(yes)

likelihood for play=no

P(x/no) * P(no) = P(overcast/no) * P(Temperature=66/no) * P(Humidity=90/no) * P(True/no) * P(no)

The new values needed to calculate the above equations are:

P(overcast/yes) = 4/9

and

P(overcast/no) = 0/5 = 0

Rest of the values needed to calculate the likelihood are taken from the previous example itself.

P(x/yes) * P(yes) = (2/9) * 0.034 * 0.0221 * (3/9) * (9/14) = 0.000036

P(x/no) * P(no) = 0 * 0.0279* 0.0381* (3/5) * (5/14) = 0

0.000036 > 0

Classification — YES

Here, it can be seen that one conditional probability P(overcast/no) was the driving factor in classification. Now, let’s see how can we employ the formula for Laplace Estimator from Image 4 under the uniform distribution assumption.

For Outlook = overcast, the new probability becomes

P(overcast/yes) = (4 + 3 * (1/3)) / (9 + 3)= 5/12

where,

nc = 4, since 4 instances where Outlook = overcast & play = yes,

n = 9, since total instances where play = yes,

m = 3, since the attribute Outlook has 3 unique values (sunny, overcast, rainy),

p = 1/m = 1/3, since the uniform distribution is assumed

Similarly,

P(overcast/no) = (0 + 3 * (1/3)) / (5 + 3)= 1/8

where,

nc = 0, since 0 instances where Outlook = overcast & play = no,

n = 5, since total instances where play = no,

m = 3, since the attribute Outlook has 3 unique values (sunny, overcast, rainy),

p = 1/m = 1/3, since the uniform distribution is assumed

Note: While applying Laplace Estimator, ensure that you apply it to all the ordinal attributes. You can’t just apply is to the attribute where the Zero frequency problem is occurring.

Since, the other ordinal attribute in our instance to classify is the attribute Windy, we need to apply Laplace Estimator there as well. After applying the modified probabilities are:

For Windy = True, the new probability becomes

P(True/yes) = (3 + 2 * (1/2)) / (9 + 2) = 4/11

where,

nc = 3, since 3 instances where Windy = True & play = yes,

n = 9, since total instances where play = yes,

m = 2, since the attribute Windy has 2 unique values (True, False),

p = 1/m = 1/2, since the uniform distribution is assumed

Similarly,

P(True/no) = (3 + 2* (1/2)) / (5 + 2)= 4/7

where,

nc = 3, since 3 instances where Windy = True & play = no,

n = 5, since total instances where play = no,

m = 2, since the attribute Windy has 2 unique values (True, False),

p = 1/m = 1/2, since the uniform distribution is assumed

P(x/yes) * P(yes) = (5/12) * 0.034 * 0.0221 * (4/11) * (9/14) = 0.0000731

P(x/no) * P(no) = 1/8 * 0.0279* 0.0381* (4/7) * (5/14) = 0.0000271

0.0000731 > 0.0000271

Classification — YES

Even though the classification did not change but now we have a better scientific reasoning behind our conclusion.

If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. It’s $5 a month, giving you unlimited access to stories on Medium. If you sign up using my link, I’ll earn a small commission at no extra cost to you.

I am giving away a free eBook on Consistency. Get your free eBook here.

Thank you for reading. I hope anyone reading this got handling of Continuous Data and Zero Frequency Problem in Naive Bayes Classifier cleared up. Share if you feel like it can help others. You can read more of my posts here:

Data Science
Machine Learning
Computer Science
Towards Data Science
Classification
Recommended from ReadMedium