Handling Continuous features in Decision Trees

Summary

The provided web content discusses methods for handling continuous features in decision trees, focusing on discretization techniques and splitting measures for optimal data partitioning.

Abstract

The article titled "[ML Shot of the Day]: Discretization of Continuous Attributes" delves into the strategies for managing continuous features in decision trees, a fundamental aspect of machine learning. It explains the process of choosing optimal splitting points for continuous attributes to improve the purity of subsets in decision tree nodes. The article covers popular splitting measures such as Information Gain, Gini Index, and Gain ratio, and illustrates how these measures calculate the impurity of child nodes relative to their parent nodes. It also addresses the challenge of handling continuous-valued attributes, contrasting binary splits using comparison operators with multiway splits using range buckets. The author details two discretization methods: equal width and equal frequency, and discusses the use of clustering algorithms for defining optimal categories. Additionally, the article explores a more optimized approach for binary attribute conversion by selecting midpoint candidates with different class labels to reduce computational complexity. The piece concludes with an invitation for readers to engage with the author on LinkedIn and a promotion for an AI service.

Opinions

The author emphasizes the importance of decision trees and their variants, such as Random Forests and XGBoost, in the machine learning community, including their use in competitions.
The article suggests that the field of AI/ML/DS is rapidly evolving, and there is a need to cover tricky concepts that might be overlooked, such as the discretization of continuous attributes.
The author encourages reader interaction and idea exchange, indicating a commitment to community engagement and continuous learning within the field.
By providing a cost-effective AI service recommendation, the author implies that there are more economical alternatives to popular AI services like ChatGPT Plus (GPT-4).

A Crash Course on Decision Trees and Splitting Measures:

Decision Trees and its variants, Random Forests, XGBoost, CatBoost are popularly used in the Machine Learning world (including competitions).

Training a Decision Tree for a classification problem involves recursively splitting the data into smaller subsets until each node contains data belonging to a single class.

Different measures (Information Gain, Gini Index, Gain ratio) are used for determining the best possible split at each node of the decision tree.

Splitting Measures for growing Decision Trees:

Recursively growing a tree involves selecting an attribute and a test condition that divides the data at a given node into smaller but pure subsets.

The measures used for determining the best split computes the degree of impurity of the child nodes.

Computing the impurity of child nodes with respect to that of parent nodes is called Gain. Higher the Gain (G), the better the split.

Let pₖ be the proportion of records belonging to class k at a given node. The impurity measures are given by :

Image by the Author

The Gain is computed as:

Image by the Author

Take some time to think about it (Not long though..its an ML shot)

The test condition for a continuous-valued attribute can either be expressed using a comparison operator (≥, ≤) or the attribute can be split into a finite set of range buckets. It is important to note that a comparison-based test condition gives us a binary split whereas range buckets give us a multiway split.

Image by the Author

Converting a continuous-valued attribute into a categorical attribute (multiway split) :

An equal width approach converts the continuous data points into n categories each of equal width. For instance, a continuous-valued attribute with a range of 0–50 can be converted into 5 categories of equal width -[0–10), [10–20), [20–30), [30–40), [40–50]. The number of categories is a hyper-parameter.

It is important to note that the equal width approach is sensitive to outliers.

The equal frequency approach converts the continuous-valued attribute into n categories such that each category contains approximately the same number of data points.

More sophisticated methods involve the use of unsupervised clustering algorithms to define the optimal categories.

Converting a continuous-valued attribute into a binary attribute (two-way split):

A comparison bases test condition of the form attribute >= v involves the determination of v.

It is easy to see that a brute force approach of trying out every single value of the continuous variable is computationally expensive.

A better way for identifying the split candidates involves sorting the values of the continuous attribute and taking the midpoint of the adjacent values in the sorted array.

As seen in the figure below, the potential candidates for the split can be narrowed down to -15, -9, 0, 12, and 21.

Image by the Author

It is evident that the number of candidates after taking the midpoint of the sorted array can still be computationally expensive.

A more optimized version involves selecting midpoint candidates with different class labels. This will narrow down the candidates to -9 and 12 which is a significant improvement over the brute force approach.

[ML Shot of the Day]: Discretization of Continuous Attributes

Handling Continuous features in Decision Trees

Choosing the optimal splitting point for continuous attributes in Decision Trees

A Crash Course on Decision Trees and Splitting Measures:

Splitting Measures for growing Decision Trees:

Get an email whenever Pritish Jadhav publishes.

Get an email whenever Pritish Jadhav publishes. By signing up, you will create a Medium account if you don't already…

The curious case of Continuous Attributes:

Take some time to think about it (Not long though..its an ML shot)

Converting a continuous-valued attribute into a categorical attribute (multiway split) :

Converting a continuous-valued attribute into a binary attribute (two-way split):

Join Medium with my referral link - Pritish Jadhav

Read every story from Pritish Jadhav (and thousands of other writers on Medium). Your membership fee directly supports…

Final Thoughts:

Let’s have a chat :

Why is ReLU preferred over Sigmoid Activation?

Diving Deeper into Deep Learning — ReLU vs Sigmoid Activation function.