Understanding Decision Trees
My notes on Decision Trees from the course — Analytics Edge
Introduction
In his book, Data Science from Scratch, Joel Grus has used a very interesting example to make his readers understand the concept of Decision Trees. Since the example is too perfect, I shall quote the same. He says — As children, how many of you remember playing the game of twenty questions? In this game, one child would think of an animal or a place or a famous personality, etc. Others would ask questions to guess it. The game would go something like this :
“I am thinking of an animal.”
“Does it have more than five legs?”
“No”
“Is it delicious?”
“No”
“Does it appear on the back of the Australian 5 cent coin?”
“Yes”
“Is it an echidna?”
“Yes, it is!”
Now let’s create a little elaborate graph for the “Guess the Animal “ game we just played.

This is exactly how we would create a Decision Tree for any Data Science Problem also. Now let us study in detail the math behind it.
The following article is primarily the notes I made while taking the course titled Analytics Edge on Edx.
What is a Decision Tree?
A Decision Tree is a supervised learning predictive model that uses a set of binary rules to calculate a target value. It is used for either classification (categorical target variable) or regression (continuous target variable). Hence, it is also known as CART (Classification & Regression Trees). Some real-life applications of Decision Trees include:
- Credit scoring models in which the criteria that cause an applicant to be rejected need to be clearly documented and free from bias
- Marketing studies of customer behavior such as satisfaction or churn, which will be shared with management or advertising agencies
- Diagnosis of medical conditions based on laboratory measurements, symptoms, or the rate of disease progression
Structure of a Decision Tree
Decision trees have three main parts:
- Root Node: The node that performs the first split. In the above “Guess the Animal” example, the root node would be the question.
lives in water.
- Terminal Nodes/Leaves: Nodes that predict the outcome. Likewise, for the example above, terminal nodes would be
bull ,cow, Lion, Tiger etc
- Branches: arrows connecting nodes, showing the flow from question to answer.
The root node is the starting point of the tree, and both root and terminal nodes contain questions or criteria to be answered. Each node typically has two or more nodes extending from it. For example, if the question in the first node requires a “yes” or “no” answer, there will be one leaf node for a “yes” response and another node for “no.”

The Algorithm behind Decision Trees.
The algorithm of the decision tree models works by repeatedly partitioning the data into multiple sub-spaces so that the outcomes in each final sub-space is as homogeneous as possible. This approach is technically called recursive partitioning. The produced result consists of a set of rules used for predicting the outcome variable, which can be either:
- a continuous variable, for regression trees
- a categorical variable, for classification trees
The decision rules generated by the CART (Classification & Regression Trees) predictive model are generally visualized as a binary tree.
Let’s look at an example to understand it better. The plot below shows sample data for two independent variables, x and y, and each data point is colored by the outcome variable, red or grey.

CART tries to split this data into subsets so that each subset is as pure or homogeneous as possible. The first three splits that CART would create are shown here.



If a new observation fell into any of the subsets, it would now be decided by the majority of the observations in that particular subset.

Let us now see how a Decision Tree algorithm generates a TREE. The tree for the splits we just generated is shown below.

- The first split tests whether the variable x is less than 60. If yes, the model says to predict red, and if no, the model moves on to the next split.
- Then, the second split checks whether or not the variable y is less than 20. If no, the model says to predict gray, but if yes, the model moves on to the next split.
The third split checks whether or not the variable x is less than 85. If yes, then the model says to predict red, and if no, the model says to predict grey.
Advantages of Decision Trees
- It is quite interpretable and easy to understand.
- It can also be used to identify the most significant variables in your data-set
Predictions from Decision Trees
In the above example, we discussed Classification trees, i.e., when the output is a factor/category: red or gray. Trees can also be used for regression where the output at each leaf of the tree is no longer a category but a number. They are called Regression Trees.
Classification Trees:
With Classification Trees, we report the average outcome at each leaf of our tree. However, Instead of just taking the majority outcome to be the prediction, we can compute the percentage of data in a subset of each type of outcome.
Let us understand it through the same example that we used above.

The above dataset has been split into four subsets.
Predictions for Subset 1:
- Red data = 7, Grey data = 2
- % of Red data = 7/(7+2) ~ 78% and % of Grey data ~22%. This means 78% of the data is Red.
- Now just like in Logistic Regression, we can use a threshold value to obtain our prediction.
- A Threshold of 0.5/50% corresponds to picking the most frequent outcome, which would be Red.
- But if we increase that threshold to 0.9/90%, we would predict Grey
Regression Trees:
To predict the outcome in such cases, since we have continuous output variables, we report the average values at that leaf. For example, if we had the values 3, 4, and 5 at one of the leaves, we will take the average, i.e., 4.
Let us see it graphically.

In the above graph:
y = Outcome/target variable i.e variable we are trying to predict
x = Independent variable
Firstly, Let’s fit a linear regression to this data set. By doing so, we obtain a line.

As is quite evident, linear regression does not do very well on this data set.

However, we can notice a very interesting feature. The data lies in three different groups. If we draw lines here, we see x is less than 10, between 10 and 20, or greater than 20.
We recall that Decision Trees can fit in this kind of problem easily. So if splits are at:
- x ≤10 |output would be the average of those values.
- 10 < x ≤ 20 |output would be the average of those values.
- 20< x≤ 30 |output would be the average of those values.
Measures Used for Split
There are different ways to control how many splits are generated.
- Gini Index: It is the measure of inequality of distribution. It says if we select two items from a population at random, then they must be of the same class and the probability for this is 1 if the population is pure.
- It works with the categorical target variable “Success” or “Failure.”
- It performs only Binary splits.
- Lower the value of Gini, the higher the homogeneity.
- CART uses the Gini method to create binary splits.
The process to calculate Gini Measure:

where P(j) is the Probability of Class j
2. Entropy: Entropy is a way to measure impurity.
Less impure nodes require less information to describe them, and more impure nodes require more information. If the sample is completely homogeneous, then the entropy is zero, and if the sample is an equally divided one, it has an entropy of one.

3. Information Gain: Information Gain is simply a mathematical way to capture the amount of information one gains(or reduction in randomness) by picking a particular attribute.
In a decision algorithm, we start at the tree root and split the data on the feature that results in the largest information gain (IG). In other words, IG tells us how important a given attribute is.
The Information Gain (IG) can be defined as follows:

Where I could be entropy or Gini index, D (p), D(Left), and D(Right) are the dataset of the parent, left, and right child node.
In R, a parameter that controls this is minbucket. The smaller it is, the more splits will be generated However, If it is too small, overfitting will occur. And, if it is too large, model will be too simple and accuracy will be poor
Decision Trees in R
We will be working on the famous Boston housing dataset. This data comes from a paper, “Hedonic Housing Prices and the Demand for Clean Air,” exploring the relationship between prices and clean air in the late 1970s. We will explore the boston.csv data set with the aid of trees. Here we are interested in building a model initially of how prices vary by location across a region.
Dataset
We will explore the boston.csv
data set with the aid of trees. Download this file from here to follow along. Each entry of the dataset corresponds to a census tract. As a result, there are multiple census tracts :
LON and LAT are the longitude and latitude of the center of the census tract.
MEDV is the median value of owner-occupied homes, measured
in thousands of dollars.
CRIM is the per capita crime rate.
ZN is related to how much of the land is zoned for large residential properties.
INDUS is the proportion of the area used for industry.
CHAS is 1 if a census tract is next to the Charles River else 0
NOX is the concentration of nitrous oxides in the air, a measure of air pollution.
RM is the average number of rooms per dwelling.
AGE is the proportion of owner-occupied units built before 1940.
DIS is a measure of how far the tract is from centres of employment in Boston.
RAD is a measure of closeness to important highways.
TAX is the property tax per $10,000 of value.
PTRATIO is the pupil to teacher ratio by town.
here MEDV is the output /target variable, i.e., the price of the house to be predicted. Since the output variable is continuous, it is an example of a regression tree.
Working
1. Analyzing the Data
- Loading data into R console:
> boston = read.csv('boston.csv')
> str(boston)
'data.frame': 506 obs. of 16 variables:
$ TOWN : Factor w/ 92 levels "Arlington","Ashland",..: 54 77 77 46 46 46 69 69 69 69 ...
$ TRACT : int 2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
$ LON : num -71 -71 -70.9 -70.9 -70.9 ...
$ LAT : num 42.3 42.3 42.3 42.3 42.3 ...
$ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
$ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ CHAS : int 0 0 0 0 0 0 0 0 0 0 ...
$ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ RM : num 6.58 6.42 7.18 7 7.15 ...
$ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
$ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
$ TAX : int 296 242 242 222 222 222 311 311 311 311 ...
$ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
There are 506 observations corresponding to 506 census tracts in the Greater Boston area. We are interested in building a model of how prices vary by location across a region. So, let’s first see how the points are laid out. Using the plot commands, we can plot the latitude and longitude of each of our census tracts.
Let’s first see how the points are laid out using the plot commands.
# Plot observations
> plot(boston$LON, boston$LAT)

The dense central core of points corresponds to Boston city and other urban cities. Since we also have the Charles river attribute(CHAS), we also want to show all the points that lie along the Charles River in blue color. We can do this by the points command.
# Tracts alongside the Charles River
> points(boston$LON[boston$CHAS==1], boston$LAT[boston$CHAS==1], col="blue", pch=19)

Now we have plotted the tracks in Boston along Charles River.
What other things can we do?
Well, this data set was originally constructed to investigate questions about how air pollution affects prices. The air pollution variable in the data is NOX. Let’s have a look at the distribution of NOX.
# Plot pollution/NOX
> summary(boston$NOX)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3850 0.4490 0.5380 0.5547 0.6240 0.8710
The minimum value is 0.385, and the maximum value is 0.87. The median and the mean are about 0.53, 0.55. So, let’s use the value of 0.55, as it is the centermost value.
Let’s look at the tracts that have above-average pollution.
> points(boston$LON[boston$NOX>=0.55], boston$LAT[boston$NOX>=0.55], col="green", pch=20)

All the points that have got above-average pollution are colored green. Now it kind of makes sense since the area most densely polluted is the one that is also most densely populated.
Now let us look at how the prices vary over the area as well. We can do this with the help of the MEDV variable using the same methodology as done when plotting the pollution.
# Plot prices
> plot(boston$LON, boston$LAT)
> summary(boston$MEDV)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 17.02 21.20 22.53 25.00 50.00
> points(boston$LON[boston$MEDV>=21.2], boston$LAT[boston$MEDV>=21.2], col="red", pch=20)

So what we see now are all the census tracts with above-average housing prices in red. However, the census tracts of above-average and below-average are mixed in between each other. But there are some patterns. For example, look at that dense black bit in the middle. That corresponds to most of the city of Boston, especially the southern parts of the city. So there’s definitely some structure to it, but it’s certainly not simple in relation to latitude and longitude, at least.
2. Applying Linear Regression to the problem
Since this is a regression problem as the target value to be predicted is continuous(house price), it is but natural that we look up to the Linear Regression algorithm to solve the problem. We saw in the last graph that the house prices were distributed over the area in an interesting way, certainly not the kind of linear way, and we feel Linear Regression is not going to work very well here. Let's back up our intuition with facts.
Here we are plotting the relationship between latitude and house prices and the longitude and the house prices, which look pretty nonlinear.
# Linear Regression using LAT and LON
> plot(boston$LAT, boston$MEDV)
> plot(boston$LON, boston$MEDV)


Linear Regression Model
> latlonlm = lm(MEDV ~ LAT + LON, data=boston)
> summary(latlonlm)
Call:
lm(formula = MEDV ~ LAT + LON, data = boston)
Residuals:
Min 1Q Median 3Q Max
-16.460 -5.590 -1.299 3.695 28.129
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3178.472 484.937 -6.554 1.39e-10 ***
LAT 8.046 6.327 1.272 0.204
LON -40.268 5.184 -7.768 4.50e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.693 on 503 degrees of freedom
Multiple R-squared: 0.1072, Adjusted R-squared: 0.1036
F-statistic: 30.19 on 2 and 503 DF, p-value: 4.159e-13
- R-squared is around 0.1, which is not great.
- The latitude is not significant, which means the north-south location differences aren’t going to be really used at all. This also seems unlikely.
- Longitude is significant but negative, which means that house prices decrease linearly as we go towards the east, which is also unlikely.
Let’s see how this linear regression model looks on a plot. So we shall plot the census tracts again and then plot the above-median house prices with bright red dots. The red dots will tell us the actual positions in Boston where houses are costly. We shall then test the same fact with Linear Regression predictions using the blue $ sign,
# Visualize regression output
> plot(boston$LON, boston$LAT)
> points(boston$LON[boston$MEDV>=21.2], boston$LAT[boston$MEDV>=21.2], col="red", pch=20)
> latlonlm$fitted.values
> points(boston$LON[latlonlm$fitted.values >= 21.2], boston$LAT[latlonlm$fitted.values >= 21.2], col="blue", pch="$")

The linear regression model has plotted a dollar sign every time it thinks the census tract is above the median value. It’s almost a sharp line that the linear regression defines. Also, the shape is almost vertical since the latitude variable was not very significant in the regression. The blue $ and the red dots do not overlap, especially in the east.
It turns out; the linear regression model isn’t really doing a good job. And it has completely ignored everything to the right side of the picture. So that’s interesting and pretty wrong.
3. Applying Regression Trees to the problem
We’ll first load the rpart
library and also install and load the rpart plotting library.
# Load CART packages
> library(rpart)
# install rpart package
> install.packages("rpart.plot")
> library(rpart.plot)
We will build a regression tree in the same way we would build a classification tree, using the part command. We would be predicting MEDV
as a function of latitude and longitude, using the boston
dataset.
# CART model
> latlontree = rpart(MEDV ~ LAT + LON, data=boston)
# Plot the tree using prp command defined in rpart.plot package
> prp(latlontree)

The leaves of the tree are important.
- In a classification tree, the leaves would be the classification that we assign.
- In regression trees, we instead predict the number. That number here is the average of the median house prices in that bucket.
Now, let us visualize the output. We’ll again plot the points with above-median prices just like in Linear Regression.
# Visualize output
> plot(boston$LON, boston$LAT)
>points(boston$LON[boston$MEDV>=21.2],boston$LAT[boston$MEDV>=21.2], col="red", pch=20)
The above plot is of actual known prices. It is the same plot that we observed with red dots. We want to predict what the tree thinks is above the median house price. So we’ll name those values as fitted values obtained from using the predict command on the tree we just built.
> fittedvalues = predict(latlontree)
>points(boston$LON[fittedvalues>21.2],boston$LAT[fittedvalues>=21.2], col="blue", pch="$")

The fitted values are greater than 21.2, the color is blue, and the character is a $ to signify Price
The Regression tree has done a much better job and has kind of overlapped the red dots. It has left the low-value area in Boston out and has correctly managed to classify some of those points in the bottom right and top right. `
But the tree obtained was very complicated and was overfitted.
How to avoid overfitting? By changing the minbucket
size.
So let’s build a new tree using the rpart
command again.
Simplifying Tree by increasing minbucket
> latlontree = rpart(MEDV ~ LAT + LON, data=boston, minbucket=50)
> plot(latlontree)
> text(latlontree)

Here we have far fewer splits, and it’s far more interpretable.
We have seen that regression trees can do what we would never expect linear regression to do. Now let’s see how the regression trees will help us build a predictive model and predict house prices?
4. Prediction with Regression Trees
We’re going to try to predict house prices using all the variables we have available to us.
Steps:
- Split the data into the Training and Testing set using
caTools
library. We shall then set the seed, so our results are reproducible.
# Split the data
> library(caTools)
> set.seed(123)
> split = sample.split(boston$MEDV, SplitRatio = 0.7)
> train = subset(boston, split==TRUE)
> test = subset(boston, split==FALSE)
Our training data is a subset of the Boston data where the split is TRUE. And the testing data is the subset of the Boston data where the split is FALSE
- Making a Regression Tree Model
# Create a CART model
> tree = rpart(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data=train)
> prp(tree)

Results:
- Latitude and Longitude aren’t significant
- The rooms are the most important split.
- Pollution appears in there twice, so it’s, in some sense, nonlinear on the amount of pollution, i.e., if it’s greater than a certain amount or less than a certain amount, it does different things.
- Very nonlinear on the number of rooms.
Regression Tree Predictions
> tree.pred = predict(tree, newdata=test)
> tree.sse = sum((tree.pred - test$MEDV)^2)
> tree.sse
4328.988
Conclusion
Even though the Decision Trees appears to be working very well in certain conditions, it comes with its own perils. The model has very high chances of “over-fitting.” Infact it is the key challenge in the case of Decision Trees. If no limit is set, it will end up putting each observation into a leaf node in the worst case.
Some techniques help improve the performance of Decision Trees, but we will learn about them but in the next article. We will learn to improve the results of our Decision Trees using the “cp” parameter. Also, we will try and implement Cross-Validation, a technique to avoid overfitting in predictive models. Finally, we will dive into Ensembling, i.e., combining the results of multiple models to solve a given prediction or classification problem.
So stay tuned for the next part.