avatarParul Pandey

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

14932

Abstract

class="hljs-built_in">to</span> <span class="hljs-keyword">the</span> Charles River <span class="hljs-keyword">else</span> <span class="hljs-number">0</span> NOX is <span class="hljs-keyword">the</span> concentration <span class="hljs-keyword">of</span> nitrous oxides <span class="hljs-keyword">in</span> <span class="hljs-keyword">the</span> air, <span class="hljs-keyword">a</span> measure <span class="hljs-keyword">of</span> air pollution. RM is <span class="hljs-keyword">the</span> <span class="hljs-built_in">average</span> <span class="hljs-built_in">number</span> <span class="hljs-keyword">of</span> rooms per dwelling. AGE is <span class="hljs-keyword">the</span> proportion <span class="hljs-keyword">of</span> owner-occupied units built <span class="hljs-keyword">before</span> <span class="hljs-number">1940.</span> DIS is <span class="hljs-keyword">a</span> measure <span class="hljs-keyword">of</span> how far <span class="hljs-keyword">the</span> tract is <span class="hljs-built_in">from</span> centres <span class="hljs-keyword">of</span> employment <span class="hljs-keyword">in</span> Boston. RAD is <span class="hljs-keyword">a</span> measure <span class="hljs-keyword">of</span> closeness <span class="hljs-built_in">to</span> important highways. TAX is <span class="hljs-keyword">the</span> property tax per <span class="hljs-number">10</span>,<span class="hljs-number">000</span> <span class="hljs-keyword">of</span> <span class="hljs-built_in">value</span>. PTRATIO is <span class="hljs-keyword">the</span> pupil <span class="hljs-built_in">to</span> teacher ratio <span class="hljs-keyword">by</span> town.</pre></div><p id="8fa5">here <b>MEDV</b> is the <b>output /target</b> variable, i.e., the price of the house to be predicted. Since the output variable is continuous, it is an example of a regression tree.</p><h1 id="410d">Working</h1><h2 id="7f60">1. Analyzing the Data</h2><ul><li>Loading data into R console:</li></ul><div id="3029"><pre><span class="hljs-meta prompt_">&gt;</span> <span class="language-javascript">boston = read.<span class="hljs-title function_">csv</span>(<span class="hljs-string">'boston.csv'</span>)</span></pre></div><div id="7f43"><pre>&gt; <span class="hljs-function"><span class="hljs-title">str</span>(<span class="hljs-variable">boston</span>)</span></pre></div><div id="3b20"><pre>'data.frame': <span class="hljs-number">506</span> obs. of <span class="hljs-number">16</span> variables: TOWN : Factor w/ <span class="hljs-number">92</span> levels "Arlington","Ashland",..: <span class="hljs-number">54 77 77 46</span> <span class="hljs-number">46 46 69 69</span> <span class="hljs-number">69</span> <span class="hljs-number">69</span> ... TRACT : int <span class="hljs-number">2011 2021</span> <span class="hljs-number">2022 2031</span> <span class="hljs-number">2032 2033</span> <span class="hljs-number">2041 2042</span> <span class="hljs-number">2043 2044</span> ... LON : num -<span class="hljs-number">71</span> -<span class="hljs-number">71</span> -<span class="hljs-number">70</span>.<span class="hljs-number">9</span> -<span class="hljs-number">70</span>.<span class="hljs-number">9</span> -<span class="hljs-number">70</span>.<span class="hljs-number">9</span> ... LAT : num <span class="hljs-number">42.3 42.3</span> <span class="hljs-number">42.3 42.3</span> <span class="hljs-number">42</span>.<span class="hljs-number">3</span> ... MEDV : num <span class="hljs-number">24 21.6 34</span>.<span class="hljs-number">7 33.4 36</span>.<span class="hljs-number">2 28.7 22</span>.<span class="hljs-number">9 22.1 16</span>.<span class="hljs-number">5</span> <span class="hljs-number">18</span>.<span class="hljs-number">9</span> ... CRIM : num <span class="hljs-number">0.00632 0</span>.<span class="hljs-number">02731 0</span>.<span class="hljs-number">02729 0</span>.<span class="hljs-number">03237 0</span>.<span class="hljs-number">06905</span> ... ZN : num <span class="hljs-number">18 0 0 0</span> <span class="hljs-number">0 0 12.5</span> <span class="hljs-number">12.5 12.5</span> <span class="hljs-number">12</span>.<span class="hljs-number">5</span> ... INDUS : num <span class="hljs-number">2.31 7.07</span> <span class="hljs-number">7.07 2.18</span> <span class="hljs-number">2.18 2.18</span> <span class="hljs-number">7.87 7.87</span> <span class="hljs-number">7.87 7.87</span> ... CHAS : int <span class="hljs-number">0 0 0 0</span> <span class="hljs-number">0 0 0 0</span> <span class="hljs-number">0</span> <span class="hljs-number">0</span> ... NOX : num <span class="hljs-number">0.538 0</span>.<span class="hljs-number">469</span> <span class="hljs-number">0.469 0</span>.<span class="hljs-number">458</span> <span class="hljs-number">0.458 0</span>.<span class="hljs-number">458</span> <span class="hljs-number">0.524 0</span>.<span class="hljs-number">524</span> <span class="hljs-number">0.524 0</span>.<span class="hljs-number">524</span> ... RM : num <span class="hljs-number">6.58 6.42</span> <span class="hljs-number">7.18 7 7</span>.<span class="hljs-number">15</span> ... AGE : num <span class="hljs-number">65.2 78.9</span> <span class="hljs-number">61.1 45.8</span> <span class="hljs-number">54.2 58.7</span> <span class="hljs-number">66.6 96.1</span> <span class="hljs-number">100 85.9</span> ... DIS : num <span class="hljs-number">4.09 4.97</span> <span class="hljs-number">4.97 6.06</span> <span class="hljs-number">6</span>.<span class="hljs-number">06</span> ... RAD : int <span class="hljs-number">1 2 2 3</span> <span class="hljs-number">3 3 5 5</span> <span class="hljs-number">5</span> <span class="hljs-number">5</span> ... TAX : int <span class="hljs-number">296 242 242</span> <span class="hljs-number">222 222 222</span> <span class="hljs-number">311 311</span> <span class="hljs-number">311 311</span> ... PTRATIO: num <span class="hljs-number">15.3 17.8</span> <span class="hljs-number">17.8 18.7</span> <span class="hljs-number">18.7 18.7</span> <span class="hljs-number">15.2 15.2</span> <span class="hljs-number">15.2 15.2</span> ...</pre></div><p id="160b">There are 506 observations corresponding to 506 census tracts in the Greater Boston area. We are interested in building a model of how prices vary by location across a region. So, let’s first see how the points are laid out. Using the plot commands, we can plot the latitude and longitude of each of our census tracts.</p><p id="0626">Let’s first see how the points are laid out using the plot commands.</p><div id="7b60"><pre><span class="hljs-meta prompt_"># </span><span class="language-bash">Plot observations</span> <span class="hljs-meta prompt_">&gt; </span><span class="language-bash">plot(boston<span class="hljs-variable">LON</span>, boston<span class="hljs-variable">$LAT</span>)</span></pre></div><figure id="a18b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Gj_Ixu8dHD2_V7f0TS90dQ.png"><figcaption></figcaption></figure><p id="f78d">The dense central core of points corresponds to Boston city and other urban cities. Since we also have the Charles river attribute(<b>CHAS</b>), we also want to show all the points that lie along the Charles River in blue color.

We can do this by the points command.</p><div id="57f0"><pre><span class="hljs-comment"># Tracts alongside the Charles River</span></pre></div><div id="99b6"><pre>> points(boston<span class="hljs-variable">LON</span>[boston<span class="hljs-variable">CHAS</span>==1], boston<span class="hljs-variable">LAT</span>[boston<span class="hljs-variable">CHAS</span>==1], <span class="hljs-attribute">col</span>=<span class="hljs-string">"blue"</span>, <span class="hljs-attribute">pch</span>=19)</pre></div><figure id="e333"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*oVyeaE5P8OC-xjSxkYI-kg.png"><figcaption></figcaption></figure><p id="a1c1">Now we have plotted the tracks in Boston along Charles River.</p><p id="c461">What other things can we do?</p><p id="6ae0">Well, this data set was originally constructed to investigate questions about how air pollution affects prices. The air pollution variable in the data is <b>NOX. </b>Let’s have a look at the distribution of NOX.</p><div id="50c2"><pre><span class="hljs-meta"># Plot pollution/NOX </span></pre></div><div id="1367"><pre><span class="hljs-meta prompt_">></span> <span class="language-javascript"><span class="hljs-title function_">summary</span>(bostonNOX)</span></pre></div><div id="68c8"><pre> <span class="hljs-attribute">Min</span>. <span class="hljs-number">1</span>st Qu. Median Mean <span class="hljs-number">3</span>rd Qu. Max. <span class="hljs-attribute">0</span>.<span class="hljs-number">3850</span> <span class="hljs-number">0</span>.<span class="hljs-number">4490</span> <span class="hljs-number">0</span>.<span class="hljs-number">5380</span> <span class="hljs-number">0</span>.<span class="hljs-number">5547</span> <span class="hljs-number">0</span>.<span class="hljs-number">6240</span> <span class="hljs-number">0</span>.<span class="hljs-number">8710</span></pre></div><p id="d9f7">The minimum value is 0.385, and the maximum value is 0.87. The median and the mean are about 0.53, 0.55. So, let’s use the value of 0.55, as it is the centermost value.</p><p id="401e">Let’s look at the tracts that have above-average pollution.</p><div id="92db"><pre>&gt; points(boston<span class="hljs-variable">LON</span>[boston<span class="hljs-variable">NOX</span>&gt;=0.55], boston<span class="hljs-variable">LAT</span>[boston<span class="hljs-variable">NOX</span>&gt;=0.55], <span class="hljs-attribute">col</span>=<span class="hljs-string">"green"</span>, <span class="hljs-attribute">pch</span>=20)</pre></div><figure id="541f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*X1U5kPMKIYnOJ9rRzxFR2Q.png"><figcaption></figcaption></figure><p id="8470">All the points that have got above-average pollution are colored green. Now it kind of makes sense since the area most densely polluted is the one that is also most densely populated.</p><p id="e2b4">Now let us look at how the <b>prices</b> vary over the area as well. We can do this with the help of the MEDV variable using the same methodology as done when plotting the pollution.</p><div id="af5c"><pre><span class="hljs-meta prompt_"># </span><span class="language-bash">Plot prices</span> <span class="hljs-meta prompt_">&gt; </span><span class="language-bash">plot(boston<span class="hljs-variable">LON</span>, boston<span class="hljs-variable">LAT</span>)</span></pre></div><div id="e4b8"><pre>&gt; summary(boston<span class="hljs-symbol">MEDV</span>) <span class="hljs-built_in">Min</span>. <span class="hljs-number">1</span>st Qu. Median <span class="hljs-built_in">Mean</span> <span class="hljs-number">3</span>rd Qu. <span class="hljs-built_in">Max</span>. <span class="hljs-number">5.00</span> <span class="hljs-number">17.02</span> <span class="hljs-number">21.20</span> <span class="hljs-number">22.53</span> <span class="hljs-number">25.00</span> <span class="hljs-number">50.00</span></pre></div><div id="7d0c"><pre>> points(boston<span class="hljs-variable">LON</span>[boston<span class="hljs-variable">MEDV</span>>=21.2], boston<span class="hljs-variable">LAT</span>[boston<span class="hljs-variable">MEDV</span>>=21.2], <span class="hljs-attribute">col</span>=<span class="hljs-string">"red"</span>, <span class="hljs-attribute">pch</span>=20)</pre></div><figure id="f9e6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Q5Q9pXoYMwx3Fqq2Pd60Eg.png"><figcaption>above median values graph</figcaption></figure><p id="f536">So what we see now are all the census tracts with above-average housing prices in red. However, the census tracts of above-average and below-average are mixed in between each other. But there are some patterns. For example, look at that dense black bit in the middle. That corresponds to most of the city of Boston, especially the southern parts of the city. So there’s definitely some structure to it, but it’s certainly not simple in relation to latitude and longitude, at least.</p><h2 id="a58d">2. Applying Linear Regression to the problem</h2><p id="0e7d">Since this is a regression problem as the target value to be predicted is continuous(house price), it is but natural that we look up to the Linear Regression algorithm to solve the problem. We saw in the last graph that the house prices were distributed over the area in an interesting way, certainly not the kind of linear way, and we feel Linear Regression is not going to work very well here. Let's back up our intuition with facts.</p><p id="80f2">Here we are plotting the relationship between <b>latitude and house prices</b> and the <b>longitude and the house prices</b>, which look pretty nonlinear.</p><div id="1dc3"><pre><span class="hljs-meta prompt_"># </span><span class="language-bash">Linear Regression using LAT and LON</span> <span class="hljs-meta prompt_">> </span><span class="language-bash">plot(boston<span class="hljs-variable">LAT</span>, boston<span class="hljs-variable">MEDV</span>)</span> <span class="hljs-meta prompt_">> </span><span class="language-bash">plot(boston<span class="hljs-variable">LON</span>, boston<span class="hljs-variable">MEDV</span>)</span></pre></div><figure id="4e79"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*QF123QyjPOS3GimDu7FT5Q.png"><figcaption></figcaption></figure><figure id="53a4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*6u5Ie4LWi8B0cBPtehzfuw.png"><figcaption></figcaption></figure><h2 id="4cec">Linear Regression Model</h2><div id="f3a4"><pre><span class="hljs-meta prompt_">></span> <span class="language-javascript">latlonlm = <span class="hljs-title function_">lm</span>(<span class="hljs-variable constant_">MEDV</span> ~ <span class="hljs-variable constant_">LAT</span> + <span class="hljs-variable constant_">LON</span>, data=boston)</span> <span class="hljs-meta prompt_">></span> <span class="language-javascript"><span class="hljs-title function_">summary</span>(latlonlm)</span></pre></div><div id="da23"><pre>Call: <span class="hljs-function"><span class="hljs-title">lm</span><span class="hljs-params">(formula = MEDV ~ LAT + LON, data = boston)</span></span></pre></div><div id="5c50"><pre>Residua<span class="hljs-symbol">ls:</span> <span class="hljs-built_in">Min</span> <span class="hljs-number">1</span>Q <span class="hljs-built_in">Median</span> <span class="hljs-number">3</span>Q <span class="hljs-built_in">Max</span> -<span class="hljs-number">16.460</span> -<span class="hljs-number">5.590</span> -<span class="hljs-number">1.299</span> <span class="hljs-number">3.695</span> <span class="hljs-number">28.129</span></pre></

Options

div><div id="1e11"><pre><span class="hljs-attr">Coefficients:</span> <span class="hljs-string">Estimate</span> <span class="hljs-string">Std.</span> <span class="hljs-string">Error</span> <span class="hljs-string">t</span> <span class="hljs-string">value</span> <span class="hljs-string">Pr(>|t|)</span>
<span class="hljs-string">(Intercept)</span> <span class="hljs-number">-3178.472</span> <span class="hljs-number">484.937</span> <span class="hljs-number">-6.554</span> <span class="hljs-number">1.39e-10</span> <span class="hljs-string"></span> <span class="hljs-string">LAT</span> <span class="hljs-number">8.046</span> <span class="hljs-number">6.327</span> <span class="hljs-number">1.272</span> <span class="hljs-number">0.204</span>
<span class="hljs-string">LON</span> <span class="hljs-number">-40.268</span> <span class="hljs-number">5.184</span> <span class="hljs-number">-7.768</span> <span class="hljs-number">4.50e-14</span> <span class="hljs-string">
</span> <span class="hljs-meta">---</span> <span class="hljs-attr">Signif. codes:</span> <span class="hljs-number">0</span> <span class="hljs-string">''</span> <span class="hljs-number">0.001</span> <span class="hljs-string">''</span> <span class="hljs-number">0.01</span> <span class="hljs-string">''</span> <span class="hljs-number">0.05</span> <span class="hljs-string">'.'</span> <span class="hljs-number">0.1</span> <span class="hljs-string">' '</span> <span class="hljs-number">1</span></pre></div><div id="1306"><pre><span class="hljs-attribute">Residual</span> standard error: <span class="hljs-number">8</span>.<span class="hljs-number">693</span> <span class="hljs-literal">on</span> <span class="hljs-number">503</span> degrees of freedom <span class="hljs-attribute">Multiple</span> R-squared: <span class="hljs-number">0</span>.<span class="hljs-number">1072</span>, Adjusted R-squared: <span class="hljs-number">0</span>.<span class="hljs-number">1036</span> <span class="hljs-attribute">F</span>-statistic: <span class="hljs-number">30</span>.<span class="hljs-number">19</span> <span class="hljs-literal">on</span> <span class="hljs-number">2</span> and <span class="hljs-number">503</span> DF, p-value: <span class="hljs-number">4</span>.<span class="hljs-number">159</span>e-<span class="hljs-number">13</span></pre></div><ul><li>R-squared is around 0.1, which is not great.</li><li>The latitude is not significant, which means the north-south location differences aren’t going to be really used at all. This also seems unlikely.</li><li>Longitude is significant but negative, which means that house prices decrease linearly as we go towards the east, which is also unlikely.</li></ul><p id="0cc2">Let’s see how this linear regression model looks on a plot. So we shall plot the census tracts again and then plot the above-median house prices with bright red dots. The red dots will tell us the actual positions in Boston where houses are costly. We shall then test the same fact with Linear Regression predictions using the blue sign,</p><div id="1246"><pre><span class="hljs-comment"># Visualize regression output</span> &gt; plot(boston<span class="hljs-variable">LON</span>, boston<span class="hljs-variable">LAT</span>) &gt; points(boston<span class="hljs-variable">LON</span>[boston<span class="hljs-variable">MEDV</span>&gt;=21.2], boston<span class="hljs-variable">LAT</span>[boston<span class="hljs-variable">MEDV</span>&gt;=21.2], <span class="hljs-attribute">col</span>=<span class="hljs-string">"red"</span>, <span class="hljs-attribute">pch</span>=20)</pre></div><div id="8d8a"><pre>&gt; latlonlm<span class="hljs-variable">fitted</span>.values > points(boston<span class="hljs-variable">LON</span>[latlonlm<span class="hljs-variable">fitted</span>.values >= 21.2], boston<span class="hljs-variable">LAT</span>[latlonlm<span class="hljs-variable">fitted</span>.values >= 21.2], <span class="hljs-attribute">col</span>=<span class="hljs-string">"blue"</span>, <span class="hljs-attribute">pch</span>=<span class="hljs-string">""</span>)</pre></div><figure id="eebc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hCej17WXr178AxquLV9zaQ.png"><figcaption></figcaption></figure><p id="7a0e">The linear regression model has plotted a dollar sign every time it thinks the census tract is above the median value. It’s almost a sharp line that the linear regression defines. Also, the shape is almost vertical since the latitude variable was not very significant in the regression. The blue and the red dots do not overlap, especially in the east.</p><p id="ef01">It turns out; the linear regression model isn’t really doing a good job. And it has completely ignored everything to the right side of the picture. So that’s interesting and pretty wrong.</p><h2 id="8131">3. Applying Regression Trees to the problem</h2><p id="a678">We’ll first load the <code>rpart</code> library and also install and load the <code>rpart plotting library.</code></p><div id="2343"><pre><span class="hljs-section"># Load CART packages</span> <span class="hljs-quote">> library(rpart)</span></pre></div><div id="d42c"><pre># install rpart package <span class="hljs-meta prompt_">></span> <span class="language-javascript">install.<span class="hljs-title function_">packages</span>(<span class="hljs-string">"rpart.plot"</span>)</span> <span class="hljs-meta prompt_">></span> <span class="language-javascript"><span class="hljs-title function_">library</span>(rpart.<span class="hljs-property">plot</span>)</span></pre></div><p id="56ca">We will build a regression tree in the same way we would build a classification tree, using the part command. We would be predicting <code>MEDV</code> as a function of latitude and longitude, using the <code>boston </code>dataset.</p><div id="18dc"><pre><span class="hljs-meta"># CART model</span> > latlontree = rpart(<span class="hljs-type">MEDV</span> ~ <span class="hljs-type">LAT</span> + <span class="hljs-type">LON</span>, <span class="hljs-class"><span class="hljs-keyword">data</span>=boston)</span></pre></div><div id="d015"><pre><span class="hljs-comment"># Plot the tree using prp command defined in rpart.plot package</span> <span class="hljs-punctuation">></span> <span class="hljs-string">prp(latlontree)</span></pre></div><figure id="2257"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*XvZT4MHTu2F4eksCcyhJ0Q.png"><figcaption>Regression Tree</figcaption></figure><p id="cfda">The <b>leaves</b> of the tree are important.</p><ul><li>In a<b> classification tree</b>, the leaves would be the classification that we assign.</li><li>In <b>regression trees</b>, we instead predict the number. That number here is the average of the median house prices in that bucket.</li></ul><p id="1881">Now, let us visualize the output. We’ll again plot the points with above-median prices just like in Linear Regression.</p><div id="f99c"><pre><span class="hljs-meta"># Visualize output</span></pre></div><div id="910a"><pre>> plot(boston<span class="hljs-variable">LON</span>, boston<span class="hljs-variable">LAT</span>) >points(boston<span class="hljs-variable">LON</span>[boston<span class="hljs-variable">MEDV</span>>=21.2],boston<span class="hljs-variable">LAT</span>[boston<span class="hljs-variable">MEDV</span>>=21.2], <span class="hljs-attribute">col</span>=<span class="hljs-string">"red"</span>, <span class="hljs-attribute">pch</span>=20)</pre></div><p id="a9af">The above plot is of actual known prices. It is the same plot that we observed with red dots. We want to predict what the tree thinks is above the median house price. So we’ll name those values as fitted values obtained from using the predict command on the tree we just built.</p><div id="2382"><pre>> fittedvalues = predict(latlontree) >points(boston<span class="hljs-variable">LON</span>[fittedvalues&gt;21.2],boston<span class="hljs-variable">LAT</span>[fittedvalues>=21.2], <span class="hljs-attribute">col</span>=<span class="hljs-string">"blue"</span>, <span class="hljs-attribute">pch</span>=<span class="hljs-string">""</span>)</pre></div><figure id="1593"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*_yN_w6oYR8pwdW-LWzDh8w.png"><figcaption></figcaption></figure><p id="846a">The fitted values are greater than 21.2, the color is blue, and the character is a to signify Price</p><p id="106b">The Regression tree has done a much better job and has kind of overlapped the red dots. It has left the low-value area in Boston out and has correctly managed to classify some of those points in the bottom right and top right. `</p><p id="1605">But the tree obtained was very complicated and was <b>overfitted</b>. How to avoid overfitting? By changing the <code>minbucket</code> size. So let’s build a new tree using the <code>rpart</code> command again.</p><h2 id="579c">Simplifying Tree by increasing minbucket</h2><div id="fd8b"><pre><span class="hljs-meta prompt_">></span> <span class="language-javascript">latlontree = <span class="hljs-title function_">rpart</span>(<span class="hljs-variable constant_">MEDV</span> ~ <span class="hljs-variable constant_">LAT</span> + <span class="hljs-variable constant_">LON</span>, data=boston, minbucket=<span class="hljs-number">50</span>)</span> <span class="hljs-meta prompt_">></span> <span class="language-javascript"><span class="hljs-title function_">plot</span>(latlontree)</span> <span class="hljs-meta prompt_">></span> <span class="language-javascript"><span class="hljs-title function_">text</span>(latlontree)</span></pre></div><figure id="7148"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*MA_LOoYafurCr0LX5JB_2w.png"><figcaption>tree with minbucket = 50</figcaption></figure><p id="3d9e">Here we have far fewer splits, and it’s far more interpretable.</p><p id="0a2d">We have seen that regression trees can do what we would never expect linear regression to do. Now let’s see how the regression trees will help us build a predictive model and predict house prices?</p><h2 id="6da1">4. Prediction with Regression Trees</h2><p id="6142">We’re going to try to predict house prices using all the variables we have available to us.</p><p id="f3ce"><b>Steps:</b></p><ul><li>Split the data into the <b>Training</b> and <b>Testing</b> set using <code>caTools</code> library. We shall then set the seed, so our results are reproducible.</li></ul><div id="dee0"><pre><span class="hljs-meta prompt_"># </span><span class="language-bash">Split the data</span> <span class="hljs-meta prompt_">> </span><span class="language-bash">library(caTools)</span> <span class="hljs-meta prompt_">> </span><span class="language-bash">set.seed(123)</span> <span class="hljs-meta prompt_">> </span><span class="language-bash"><span class="hljs-built_in">split</span> = sample.split(boston<span class="hljs-variable">MEDV</span>, SplitRatio = 0.7)</span> <span class="hljs-meta prompt_">&gt; </span><span class="language-bash">train = subset(boston, <span class="hljs-built_in">split</span>==TRUE)</span> <span class="hljs-meta prompt_">&gt; </span><span class="language-bash"><span class="hljs-built_in">test</span> = subset(boston, <span class="hljs-built_in">split</span>==FALSE)</span></pre></div><p id="a466">Our training data is a subset of the Boston data where the split is TRUE. And the testing data is the subset of the Boston data where the split is FALSE</p><ul><li><b>Making a Regression Tree Model</b></li></ul><div id="86bc"><pre># Create a CART model <span class="hljs-meta prompt_">&gt;</span> <span class="language-javascript">tree = <span class="hljs-title function_">rpart</span>(<span class="hljs-variable constant_">MEDV</span> ~ <span class="hljs-variable constant_">LAT</span> + <span class="hljs-variable constant_">LON</span> + <span class="hljs-variable constant_">CRIM</span> + <span class="hljs-variable constant_">ZN</span> + <span class="hljs-variable constant_">INDUS</span> + <span class="hljs-variable constant_">CHAS</span> + <span class="hljs-variable constant_">NOX</span> + <span class="hljs-variable constant_">RM</span> + <span class="hljs-variable constant_">AGE</span> + <span class="hljs-variable constant_">DIS</span> + <span class="hljs-variable constant_">RAD</span> + <span class="hljs-variable constant_">TAX</span> + <span class="hljs-variable constant_">PTRATIO</span>, data=train)</span> <span class="hljs-meta prompt_">&gt;</span> <span class="language-javascript"><span class="hljs-title function_">prp</span>(tree)</span></pre></div><figure id="5724"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5U9r-OE0pXS_vNGwHGyzPg.png"><figcaption></figcaption></figure><h2 id="7139">Results:</h2><ul><li>Latitude and Longitude aren’t significant</li><li><b>The rooms are the most important split.</b></li><li>Pollution appears in there twice, so it’s, in some sense, nonlinear on the amount of pollution, i.e., if it’s greater than a certain amount or less than a certain amount, it does different things.</li><li>Very nonlinear on the number of rooms.</li></ul><h2 id="656d">Regression Tree Predictions</h2><div id="7b1a"><pre><span class="hljs-meta prompt_">&gt;</span> <span class="language-javascript">tree.<span class="hljs-property">pred</span> = <span class="hljs-title function_">predict</span>(tree, newdata=test)</span> <span class="hljs-meta prompt_">&gt;</span> <span class="language-javascript">tree.<span class="hljs-property">sse</span> = <span class="hljs-title function_">sum</span>((tree.<span class="hljs-property">pred</span> - testMEDV)^<span class="hljs-number">2</span>)</span> <span class="hljs-meta prompt_">></span> <span class="language-javascript">tree.<span class="hljs-property">sse</span></span></pre></div><div id="0c8e"><pre><span class="hljs-attribute">4328</span>.<span class="hljs-number">988</span></pre></div><h1 id="f33f">Conclusion</h1><p id="aa92">Even though the Decision Trees appears to be working very well in certain conditions, it comes with its own perils. The model has very high chances of <b>“over-fitting.”</b> Infact it is the key challenge in the case of Decision Trees. If no limit is set, it will end up putting each observation into a leaf node in the worst case.</p><p id="7e94">Some techniques help improve the performance of <b>Decision Trees, </b>but we<b> </b>will learn about them but in the next article. We will learn to improve the results of our Decision Trees using the <b>“cp” parameter. </b>Also, we will try and implement Cross-Validation<b>, </b>a technique to avoid overfitting in predictive models. Finally, we will dive into <b>Ensembling,</b> i.e., combining the results of multiple models to solve a given prediction or classification problem.</p><p id="507d">So stay tuned for the next part.</p></article></body>

Understanding Decision Trees

My notes on Decision Trees from the course — Analytics Edge

Photo by Fabrice Villard on Unsplash

Introduction

In his book, Data Science from Scratch, Joel Grus has used a very interesting example to make his readers understand the concept of Decision Trees. Since the example is too perfect, I shall quote the same. He says — As children, how many of you remember playing the game of twenty questions? In this game, one child would think of an animal or a place or a famous personality, etc. Others would ask questions to guess it. The game would go something like this :

“I am thinking of an animal.”

“Does it have more than five legs?”

“No”

“Is it delicious?”

“No”

“Does it appear on the back of the Australian 5 cent coin?”

“Yes”

“Is it an echidna?”

“Yes, it is!”

Now let’s create a little elaborate graph for the “Guess the Animal “ game we just played.

Source: Data Science from Scratch: First Principles with Python

This is exactly how we would create a Decision Tree for any Data Science Problem also. Now let us study in detail the math behind it.

The following article is primarily the notes I made while taking the course titled Analytics Edge on Edx.

What is a Decision Tree?

A Decision Tree is a supervised learning predictive model that uses a set of binary rules to calculate a target value. It is used for either classification (categorical target variable) or regression (continuous target variable). Hence, it is also known as CART (Classification & Regression Trees). Some real-life applications of Decision Trees include:

  • Credit scoring models in which the criteria that cause an applicant to be rejected need to be clearly documented and free from bias
  • Marketing studies of customer behavior such as satisfaction or churn, which will be shared with management or advertising agencies
  • Diagnosis of medical conditions based on laboratory measurements, symptoms, or the rate of disease progression

Structure of a Decision Tree

Decision trees have three main parts:

  • Root Node: The node that performs the first split. In the above “Guess the Animal” example, the root node would be the question. lives in water.
  • Terminal Nodes/Leaves: Nodes that predict the outcome. Likewise, for the example above, terminal nodes would be bull ,cow, Lion, Tiger etc
  • Branches: arrows connecting nodes, showing the flow from question to answer.

The root node is the starting point of the tree, and both root and terminal nodes contain questions or criteria to be answered. Each node typically has two or more nodes extending from it. For example, if the question in the first node requires a “yes” or “no” answer, there will be one leaf node for a “yes” response and another node for “no.”

PC: Analytics Vidhya

The Algorithm behind Decision Trees.

The algorithm of the decision tree models works by repeatedly partitioning the data into multiple sub-spaces so that the outcomes in each final sub-space is as homogeneous as possible. This approach is technically called recursive partitioning. The produced result consists of a set of rules used for predicting the outcome variable, which can be either:

  • a continuous variable, for regression trees
  • a categorical variable, for classification trees

The decision rules generated by the CART (Classification & Regression Trees) predictive model are generally visualized as a binary tree.

Let’s look at an example to understand it better. The plot below shows sample data for two independent variables, x and y, and each data point is colored by the outcome variable, red or grey.

CART tries to split this data into subsets so that each subset is as pure or homogeneous as possible. The first three splits that CART would create are shown here.

If a new observation fell into any of the subsets, it would now be decided by the majority of the observations in that particular subset.

Let us now see how a Decision Tree algorithm generates a TREE. The tree for the splits we just generated is shown below.

  • The first split tests whether the variable x is less than 60. If yes, the model says to predict red, and if no, the model moves on to the next split.
  • Then, the second split checks whether or not the variable y is less than 20. If no, the model says to predict gray, but if yes, the model moves on to the next split.

The third split checks whether or not the variable x is less than 85. If yes, then the model says to predict red, and if no, the model says to predict grey.

Advantages of Decision Trees

  • It is quite interpretable and easy to understand.
  • It can also be used to identify the most significant variables in your data-set

Predictions from Decision Trees

In the above example, we discussed Classification trees, i.e., when the output is a factor/category: red or gray. Trees can also be used for regression where the output at each leaf of the tree is no longer a category but a number. They are called Regression Trees.

Classification Trees:

With Classification Trees, we report the average outcome at each leaf of our tree. However, Instead of just taking the majority outcome to be the prediction, we can compute the percentage of data in a subset of each type of outcome.

Let us understand it through the same example that we used above.

The above dataset has been split into four subsets.

Predictions for Subset 1:

  • Red data = 7, Grey data = 2
  • % of Red data = 7/(7+2) ~ 78% and % of Grey data ~22%. This means 78% of the data is Red.
  • Now just like in Logistic Regression, we can use a threshold value to obtain our prediction.
  • A Threshold of 0.5/50% corresponds to picking the most frequent outcome, which would be Red.
  • But if we increase that threshold to 0.9/90%, we would predict Grey

Regression Trees:

To predict the outcome in such cases, since we have continuous output variables, we report the average values at that leaf. For example, if we had the values 3, 4, and 5 at one of the leaves, we will take the average, i.e., 4.

Let us see it graphically.

In the above graph:

y = Outcome/target variable i.e variable we are trying to predict

x = Independent variable

Firstly, Let’s fit a linear regression to this data set. By doing so, we obtain a line.

As is quite evident, linear regression does not do very well on this data set.

However, we can notice a very interesting feature. The data lies in three different groups. If we draw lines here, we see x is less than 10, between 10 and 20, or greater than 20.

We recall that Decision Trees can fit in this kind of problem easily. So if splits are at:

  • x ≤10 |output would be the average of those values.
  • 10 < x ≤ 20 |output would be the average of those values.
  • 20< x≤ 30 |output would be the average of those values.

Measures Used for Split

There are different ways to control how many splits are generated.

  1. Gini Index: It is the measure of inequality of distribution. It says if we select two items from a population at random, then they must be of the same class and the probability for this is 1 if the population is pure.
  • It works with the categorical target variable “Success” or “Failure.”
  • It performs only Binary splits.
  • Lower the value of Gini, the higher the homogeneity.
  • CART uses the Gini method to create binary splits.

The process to calculate Gini Measure:

where P(j) is the Probability of Class j

2. Entropy: Entropy is a way to measure impurity.

Less impure nodes require less information to describe them, and more impure nodes require more information. If the sample is completely homogeneous, then the entropy is zero, and if the sample is an equally divided one, it has an entropy of one.

3. Information Gain: Information Gain is simply a mathematical way to capture the amount of information one gains(or reduction in randomness) by picking a particular attribute.

In a decision algorithm, we start at the tree root and split the data on the feature that results in the largest information gain (IG). In other words, IG tells us how important a given attribute is.

The Information Gain (IG) can be defined as follows:

Where I could be entropy or Gini index, D (p), D(Left), and D(Right) are the dataset of the parent, left, and right child node.

In R, a parameter that controls this is minbucket. The smaller it is, the more splits will be generated However, If it is too small, overfitting will occur. And, if it is too large, model will be too simple and accuracy will be poor

Decision Trees in R

We will be working on the famous Boston housing dataset. This data comes from a paper, “Hedonic Housing Prices and the Demand for Clean Air,” exploring the relationship between prices and clean air in the late 1970s. We will explore the boston.csv data set with the aid of trees. Here we are interested in building a model initially of how prices vary by location across a region.

Dataset

We will explore the boston.csv data set with the aid of trees. Download this file from here to follow along. Each entry of the dataset corresponds to a census tract. As a result, there are multiple census tracts :

LON and LAT are the longitude and latitude of the center of the census tract.
MEDV is the median value of owner-occupied homes, measured
in thousands of dollars.
CRIM is the per capita crime rate.
ZN is related to how much of the land is zoned for large residential properties.
INDUS is the proportion of the area used for industry.
CHAS is 1 if a census tract is next to the Charles River else 0 
NOX is the concentration of nitrous oxides in the air, a measure of air pollution.
RM is the average number of rooms per dwelling.
AGE is the proportion of owner-occupied units built before 1940.
DIS is a measure of how far the tract is from centres of employment in Boston.
RAD is a measure of closeness to important highways.
TAX is the property tax per $10,000 of value.
PTRATIO is the pupil to teacher ratio by town.

here MEDV is the output /target variable, i.e., the price of the house to be predicted. Since the output variable is continuous, it is an example of a regression tree.

Working

1. Analyzing the Data

  • Loading data into R console:
> boston = read.csv('boston.csv')
> str(boston)
'data.frame': 506 obs. of  16 variables:
 $ TOWN   : Factor w/ 92 levels "Arlington","Ashland",..: 54 77 77 46 46 46 69 69 69 69 ...
 $ TRACT  : int  2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
 $ LON    : num  -71 -71 -70.9 -70.9 -70.9 ...
 $ LAT    : num  42.3 42.3 42.3 42.3 42.3 ...
 $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
 $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ CHAS   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ RM     : num  6.58 6.42 7.18 7 7.15 ...
 $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ TAX    : int  296 242 242 222 222 222 311 311 311 311 ...
 $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

There are 506 observations corresponding to 506 census tracts in the Greater Boston area. We are interested in building a model of how prices vary by location across a region. So, let’s first see how the points are laid out. Using the plot commands, we can plot the latitude and longitude of each of our census tracts.

Let’s first see how the points are laid out using the plot commands.

# Plot observations
> plot(boston$LON, boston$LAT)

The dense central core of points corresponds to Boston city and other urban cities. Since we also have the Charles river attribute(CHAS), we also want to show all the points that lie along the Charles River in blue color. We can do this by the points command.

# Tracts alongside the Charles River
> points(boston$LON[boston$CHAS==1], boston$LAT[boston$CHAS==1], col="blue", pch=19)

Now we have plotted the tracks in Boston along Charles River.

What other things can we do?

Well, this data set was originally constructed to investigate questions about how air pollution affects prices. The air pollution variable in the data is NOX. Let’s have a look at the distribution of NOX.

# Plot pollution/NOX 
> summary(boston$NOX)
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3850  0.4490  0.5380  0.5547  0.6240  0.8710

The minimum value is 0.385, and the maximum value is 0.87. The median and the mean are about 0.53, 0.55. So, let’s use the value of 0.55, as it is the centermost value.

Let’s look at the tracts that have above-average pollution.

> points(boston$LON[boston$NOX>=0.55], boston$LAT[boston$NOX>=0.55], col="green", pch=20)

All the points that have got above-average pollution are colored green. Now it kind of makes sense since the area most densely polluted is the one that is also most densely populated.

Now let us look at how the prices vary over the area as well. We can do this with the help of the MEDV variable using the same methodology as done when plotting the pollution.

# Plot prices
> plot(boston$LON, boston$LAT)
> summary(boston$MEDV)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.00   17.02   21.20   22.53   25.00   50.00
> points(boston$LON[boston$MEDV>=21.2], boston$LAT[boston$MEDV>=21.2], col="red", pch=20)
above median values graph

So what we see now are all the census tracts with above-average housing prices in red. However, the census tracts of above-average and below-average are mixed in between each other. But there are some patterns. For example, look at that dense black bit in the middle. That corresponds to most of the city of Boston, especially the southern parts of the city. So there’s definitely some structure to it, but it’s certainly not simple in relation to latitude and longitude, at least.

2. Applying Linear Regression to the problem

Since this is a regression problem as the target value to be predicted is continuous(house price), it is but natural that we look up to the Linear Regression algorithm to solve the problem. We saw in the last graph that the house prices were distributed over the area in an interesting way, certainly not the kind of linear way, and we feel Linear Regression is not going to work very well here. Let's back up our intuition with facts.

Here we are plotting the relationship between latitude and house prices and the longitude and the house prices, which look pretty nonlinear.

# Linear Regression using LAT and LON
> plot(boston$LAT, boston$MEDV)
> plot(boston$LON, boston$MEDV)

Linear Regression Model

> latlonlm = lm(MEDV ~ LAT + LON, data=boston)
> summary(latlonlm)
Call:
lm(formula = MEDV ~ LAT + LON, data = boston)
Residuals:
    Min      1Q  Median      3Q     Max 
-16.460  -5.590  -1.299   3.695  28.129
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3178.472    484.937  -6.554 1.39e-10 ***
LAT             8.046      6.327   1.272    0.204    
LON           -40.268      5.184  -7.768 4.50e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.693 on 503 degrees of freedom
Multiple R-squared:  0.1072, Adjusted R-squared:  0.1036 
F-statistic: 30.19 on 2 and 503 DF,  p-value: 4.159e-13
  • R-squared is around 0.1, which is not great.
  • The latitude is not significant, which means the north-south location differences aren’t going to be really used at all. This also seems unlikely.
  • Longitude is significant but negative, which means that house prices decrease linearly as we go towards the east, which is also unlikely.

Let’s see how this linear regression model looks on a plot. So we shall plot the census tracts again and then plot the above-median house prices with bright red dots. The red dots will tell us the actual positions in Boston where houses are costly. We shall then test the same fact with Linear Regression predictions using the blue $ sign,

# Visualize regression output
> plot(boston$LON, boston$LAT)
> points(boston$LON[boston$MEDV>=21.2], boston$LAT[boston$MEDV>=21.2], col="red", pch=20)
> latlonlm$fitted.values
> points(boston$LON[latlonlm$fitted.values >= 21.2], boston$LAT[latlonlm$fitted.values >= 21.2], col="blue", pch="$")

The linear regression model has plotted a dollar sign every time it thinks the census tract is above the median value. It’s almost a sharp line that the linear regression defines. Also, the shape is almost vertical since the latitude variable was not very significant in the regression. The blue $ and the red dots do not overlap, especially in the east.

It turns out; the linear regression model isn’t really doing a good job. And it has completely ignored everything to the right side of the picture. So that’s interesting and pretty wrong.

3. Applying Regression Trees to the problem

We’ll first load the rpart library and also install and load the rpart plotting library.

# Load CART packages
> library(rpart)
# install rpart package
> install.packages("rpart.plot")
> library(rpart.plot)

We will build a regression tree in the same way we would build a classification tree, using the part command. We would be predicting MEDV as a function of latitude and longitude, using the boston dataset.

# CART model
> latlontree = rpart(MEDV ~ LAT + LON, data=boston)
# Plot the tree using prp command defined in rpart.plot package
> prp(latlontree)
Regression Tree

The leaves of the tree are important.

  • In a classification tree, the leaves would be the classification that we assign.
  • In regression trees, we instead predict the number. That number here is the average of the median house prices in that bucket.

Now, let us visualize the output. We’ll again plot the points with above-median prices just like in Linear Regression.

# Visualize output
> plot(boston$LON, boston$LAT)
>points(boston$LON[boston$MEDV>=21.2],boston$LAT[boston$MEDV>=21.2], col="red", pch=20)

The above plot is of actual known prices. It is the same plot that we observed with red dots. We want to predict what the tree thinks is above the median house price. So we’ll name those values as fitted values obtained from using the predict command on the tree we just built.

> fittedvalues = predict(latlontree)
>points(boston$LON[fittedvalues>21.2],boston$LAT[fittedvalues>=21.2], col="blue", pch="$")

The fitted values are greater than 21.2, the color is blue, and the character is a $ to signify Price

The Regression tree has done a much better job and has kind of overlapped the red dots. It has left the low-value area in Boston out and has correctly managed to classify some of those points in the bottom right and top right. `

But the tree obtained was very complicated and was overfitted. How to avoid overfitting? By changing the minbucket size. So let’s build a new tree using the rpart command again.

Simplifying Tree by increasing minbucket

> latlontree = rpart(MEDV ~ LAT + LON, data=boston, minbucket=50)
> plot(latlontree)
> text(latlontree)
tree with minbucket = 50

Here we have far fewer splits, and it’s far more interpretable.

We have seen that regression trees can do what we would never expect linear regression to do. Now let’s see how the regression trees will help us build a predictive model and predict house prices?

4. Prediction with Regression Trees

We’re going to try to predict house prices using all the variables we have available to us.

Steps:

  • Split the data into the Training and Testing set using caTools library. We shall then set the seed, so our results are reproducible.
# Split the data
> library(caTools)
> set.seed(123)
> split = sample.split(boston$MEDV, SplitRatio = 0.7)
> train = subset(boston, split==TRUE)
> test = subset(boston, split==FALSE)

Our training data is a subset of the Boston data where the split is TRUE. And the testing data is the subset of the Boston data where the split is FALSE

  • Making a Regression Tree Model
# Create a CART model
> tree = rpart(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data=train)
> prp(tree)

Results:

  • Latitude and Longitude aren’t significant
  • The rooms are the most important split.
  • Pollution appears in there twice, so it’s, in some sense, nonlinear on the amount of pollution, i.e., if it’s greater than a certain amount or less than a certain amount, it does different things.
  • Very nonlinear on the number of rooms.

Regression Tree Predictions

> tree.pred = predict(tree, newdata=test)
> tree.sse = sum((tree.pred - test$MEDV)^2)
> tree.sse
4328.988

Conclusion

Even though the Decision Trees appears to be working very well in certain conditions, it comes with its own perils. The model has very high chances of “over-fitting.” Infact it is the key challenge in the case of Decision Trees. If no limit is set, it will end up putting each observation into a leaf node in the worst case.

Some techniques help improve the performance of Decision Trees, but we will learn about them but in the next article. We will learn to improve the results of our Decision Trees using the “cp” parameter. Also, we will try and implement Cross-Validation, a technique to avoid overfitting in predictive models. Finally, we will dive into Ensembling, i.e., combining the results of multiple models to solve a given prediction or classification problem.

So stay tuned for the next part.

Machine Learning
Data Science
R Language
Decision Tree
Recommended from ReadMedium