Summary

The provided content offers a visual guide to understanding and implementing the DecisionTreeRegressor from scikit-learn in Python, focusing on the impact of hyperparameters like max_depth and min_samples_leaf without delving into mathematical complexities.

Abstract

The article presented on the website is a comprehensive tutorial on the use of Decision Tree Regressor models in machine learning. It illustrates how these models predict continuous target variables by segmenting feature spaces into distinct regions, each with a unique prediction. The author emphasizes a non-mathematical approach, relying instead on visual aids to demonstrate the model's behavior with varying hyperparameters. The discussion includes the effect of max_depth on the complexity of the decision tree, the role of min_samples_leaf in determining the minimum number of observations in the terminal nodes, and the model's performance on both non-linear and linear datasets. The article also extends the analysis to datasets with two continuous features, using 3D visualization to explore the decision boundaries created by the model. The conclusion underscores the importance of visualization in understanding machine learning models and recommends the author's other writings on machine learning and data science.

Opinions

The author believes that visualizing model predictions on simple datasets is crucial for understanding how models work, particularly for decision trees.
The article suggests that decision trees are intuitive models, and their interpretability can be enhanced by visualizing the decision rules as a tree.
It is implied that decision trees are versatile and can handle both linear and non-linear datasets effectively, although they may not always be the most suitable model for linear data.
The author endorses the No Free Lunch Theory, indicating that one cannot predict the best model for a dataset in advance and that empirical testing is necessary.
The author advocates for the practical application of multiple models, including linear regression and decision trees, to determine the most appropriate one for the data.
There is a subtle endorsement for the use of Plotly for creating interactive 3D plots, which can provide a deeper understanding of the model's behavior in multi-dimensional feature spaces.
The author promotes their Medium membership, suggesting that readers can gain valuable insights from their detailed explanations of complex machine learning concepts.
A cost-effective AI service, ZAI.chat, is recommended as an alternative to ChatGPT Plus (GPT-4), highlighting the author's view on the importance of accessible AI tools for learning and application.

Decision Tree Regressor — A Visual Guide with Scikit Learn

Understanding Decision Trees without Math

In this article, we will implement the DecisionTreeRegressor from scikit-learn in python to visualize how this model works. We will not use any mathematical terms, but we will use visualization to demonstrate how a decision tree regressor works, and the impact of some hyperparameters.

For the context, a Decision Tree Regressor tries to predict a continuous target variable by cutting the feature variables into small zones, and each zone will have one prediction. We will begin with one continuous variable and then two continuous variables. We will not use categorical variables since for decision trees, continuous variables are ultimately treated like categorical data when the splits are made.

We will mainly study the impact of two hyperparameters that can be quite intuitive to understand: max depth and min_samples_leaf. Other hyperparameters are similar, the main idea is to limit the size of the rules.

One Non-Linear variable

Let’s use some simple data, with only feature variable x.

# Import the required libraries
from sklearn.tree import DecisionTreeRegressor
import numpy as np
# Define the dataset
X = np.array([[1], [3], [4], [7], [9], [10], [11], [13], [14], [16]])
y = np.array([3, 4, 3, 15, 17, 15, 18, 7, 3, 4])

And we can visualize the data in an (x,y) plot

import matplotlib.pyplot as plt

# Plot the dataset with the decision tree splits
plt.figure(figsize=(10,6))
plt.scatter(X, y, color='blue')
plt.show()

And in the following image, for the first cut, we can visually guess two possible splits as shown below:

Decision tree regressor visualization — image by author

Now, the Decision Tree Regressor model determines exactly which split is better. Let’s specify the argument max_depth=1, to get only one split:

from sklearn.tree import DecisionTreeRegressor

# Fit the decision tree model
model = DecisionTreeRegressor(max_depth=1)
model.fit(X, y)

# Generate predictions for a sequence of x values
x_seq = np.arange(0, 17, 0.1).reshape(-1, 1)
y_pred = model.predict(x_seq)

If we decide to get 4 regions, we can try max_depth=2, and we get:

We can then visualize how the max_depth hyperparameter impacts the final model with the image below:

We can also plot the same with another hyperparameter min_samples_leaf, which is the minimum number of observations that should be in the final regions (that we call leaves because at the end of a tree’s ramification, we find leaves!)

One “Linear” Feature

Decision Tree Regressor is a non-linear regressor. We can see how it behaves/models the data from the previous examples. What happens to “linear” data?

Let’s take this simple example of perfectly linear data:

import numpy as np

X=np.arange(1,13,1).reshape(-1,1)
y=np.concatenate((np.arange(1,12,1),12), axis=None)

plt.scatter(X,y)

You can see that the relationship is very simple: y = x!

If we use the classic decision tree visualization, you may immediately see how the model behaves.

Now, we can create the same plot below. And sometimes people are surprised to SEE that a Decision Tree divides the perfect line into regions and gives a few values for the predictions, even for a linear dataset.

Then people would comment by saying that the model is not adapted at all for this dataset, and we should not use it.

Now, the truth is that you can not know the behavior of the dataset in advance. And that is what the No Free Lunch Theory is all about.

And in practice, you can apply several models such as linear regression and decision trees. If there is one model that is significant more performant than another, then you can conclude about the linear vs. non-linear behavior of the data.

Two continuous features

For two continuous variables, we have to create a 3D plot.

First, let’s generate some data.

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import plotly.graph_objs as go
from plotly.subplots import make_subplots

# Define the data
X = np.array([[1, 2], [3, 4], [4, 5], [7, 2], [9, 5], [10, 4], [11, 3], [13, 5], [14, 3], [16, 1],
              [10, 10], [16, 10], [12, 10]])
y = np.array([3, 4, 3, 15, 17, 15, 18, 7, 3, 4,8,10,13])

Then can create the model:

# Fit the decision tree model
model = DecisionTreeRegressor(max_depth=3)
model.fit(X, y)

Finally, we can create the 3D plot with plotly

# Create an interactive 3D plot with Plotly
fig = make_subplots(rows=1, cols=1, specs=[[{'type': 'surface'}]])

fig.add_trace(go.Surface(x=x_seq, y=y_seq, z=z_seq, colorscale='Viridis', showscale=True,opacity = 0.5),
              row=1, col=1)

fig.add_trace(go.Scatter3d(x=X[:, 0], y=X[:, 1], z=y, mode='markers', marker=dict(size=5, color='red')),
              row=1, col=1)

fig.update_layout(title='Decision Tree with Max Depth = {}'.format(max_depth),
                  scene=dict(xaxis_title='x1', yaxis_title='x2', zaxis_title='Predicted Y'))

fig.show()

We can compare different values of depth as shown in the image below:

If you create a plot with python, you can manipulate it to see the visualization from different angles.

Conclusion

Visualizing the prediction of a model for simple datasets is an excellent way to understand how the models work.

In the case of decision trees, they already are quite intuitive to understand with the visualization of the rules in form of a tree. The classic visualization with x,y (and z) can be complementary.

We also can SEE that the model is highly non-linear. And the dataset does not need any scaling.

I write about machine learning and data science and I explain complex concepts in a clear way. Please follow me with the link below and get full access to my articles: https://medium.com/@angela.shi/membership