avatarThe Experimental Writer

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

6030

Abstract

4 players. Each row is a record/datapoint. So we already know that player with the ranks 49 and 50 player are predicted to not win any medals. Just the players with rank 2 and 7 are predicted to win.</p><figure id="7c27"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*51S4t14wv9kYriteppgYaw.png"><figcaption>Our Sweet Little Original Dataset</figcaption></figure><p id="9846">Now let’s create the bootstrapped dataset as step 1 of understanding what Random Forests are!</p><p id="a3cf">Let’s begin by choosing one row from the above table randomly. We end up picking row 4. It goes as row/record 1 in the Bootstrapped Table.</p><figure id="f009"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*VCRBHS5JLwNJUy4nwqsxRQ.png"><figcaption>1st Sample: Pick Up the 4th Data Row from original dataset</figcaption></figure><p id="ac61">Next we pick row 1 from the original table. Note that bootstrapped dataset has the same number of datapoints/rows as the original dataset.</p><figure id="75e0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*3pSIt7sXm31Qkk2R7COTUQ.png"><figcaption>2nd Sample: Randomly chose row 1 from original dataset</figcaption></figure><p id="8a95">2 more samples later we have our bootstrapped dataset. <b>Note that row 1 is picked twice randomly </b>(it goes as row 2 and 4 in the bootstrapped table) <b>and row 2 is never picked!!</b></p><figure id="4d3f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GL75klQNTl036NgMDU36xA.png"><figcaption>Random sampling ‘with replacement’ causes Row 1 from Orig. Dataset to be picked twice.</figcaption></figure><figure id="9a00"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*eolYvemXUu91UD0qonWwUw.gif"><figcaption>The whole process!</figcaption></figure><p id="87d7"><b>Step 2 —Create a Decision Tree — </b>Once we have a bootstrapped dataset we create a decision tree from it. Remember the most important thing while creating a decision tree is to know which is the best question to ask. As explained in the <a href="https://link.medium.com/cbTEOnoIJT">Decision Tree article</a> Gini Impurity gives a measure of which question separates the data best.</p><ul><li>Select a <i>random subset </i>(=say 2) of the 4 variables (<i>Previous Gold, Injury, Sponsorship status and world ranking</i>) as candidates for the root node.</li><li>Compute the <i>Gini Impurity</i> and figure out the best candidate out of the 2. i.e. which candidate best separates the data.</li><li>For the next node again select a random subset of the remaining 3 variables and repeat the process until no more variables are left. In jargon terms, this is called <b>picking a random subset.</b></li></ul><figure id="7ae0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*JS2pEPRjFxWJHZf5yp3JyA.gif"><figcaption>Step 2 in play</figcaption></figure><p id="daf8"><b>Step 3 — Rinse and Repeat — </b>Once you have 1 decision tree start the whole iteration all over again. Create a new bootstrapped dataset and considering only a subset of variables at each step you can come up with a wide variety of Decision Trees. Keep repeating steps 1 and 2. Ideally you would need 100s of trees but for representation purpose we just show 6 trees below. <i>Note each tree comes from a different bootstrapped dataset.</i></p><figure id="f634"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Vko_J9ejaHOfwc_CqdSQpg.gif"><figcaption>Create trees by considering random subsets of columns at each node</figcaption></figure><ul><li>This randomness in the tree creation process is what lends flexibility and high accuracy to Random Forest</li></ul><h2 id="8245">2) Why does it work?</h2><p id="010e">Consider a company that is hiring for the position of CTO. A few people are assigned the role of interviewing the candidates, according to their own strengths.</p><ul><li>CEO — Judges the leadership qualities</li><li>Tech. Director — Judges the technological awareness</li><li>HR Director — Judges the personality and cultural fit</li><li>Board Members — Judge the managerial qualities</li><li>A few star employees — Judge depth of knowledge in specific segments</li></ul><p id="dd50">Each of these interviewers have their own strengths that they have acquired over many years of experience. <b>But these strengths also become biases!</b> Each interviewer is poised to choose a candidate who’s strong in their own expertise area.</p><p id="c101">If it were upto just the <i>Star Employees</i> they would end up hiring a CTO who works great as an independent contributor but might not a be a visionary leader (which was CEOs role to judge).</p><p id="16eb"><b>Random Forest </b>works similarly. Each tree is an interviewer. Each tree is biased to give results favouring a particular variable. Their combined results although come out to be pretty accurate. The whole company is the Random Forest and churns out more accurate results than an average employee a.k.a Decision Tree</p><p id="e018">Going with the example Dataset used before. Let’s say a new player registers for the race. His variables are as follows:</p><ul><li>Previous Gold Winner : <b>No</b></li><li>Any Recent Injury : <b>No</b></li><li>Sponsorship Available : <b>No</b></li><li>World Ranking : 7</li></ul><p id="ccc0">The question is <b>Will the player win a medal in this race?</b></p><p id="3f46"><b>Let’s see how and what does the Random Forest predict?</b></p><figure id="f639"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*bYGSIgMlmVdedFJaE6PuBg.gif"><figcaption>Yes the player will win a medal. The Random Forest predicts a victory for this player.</figcaption></figure><p id="3af2">The process given above is called <b>Bagging</b></p><p id="9e4a" type="7">Bootstrapping the data & using an aggregate of ‘individual results’ to obtain the ‘final result’ is called Bagging</p><h2 id="775d">3) How do we test the Random Forest?</h2><p id="04aa">How do we know if the results that our random forest gi

Options

ves us are correct? How do we trust and test it?</p><p id="b06d" type="7">We use a subset of our original data to check if the Random Forest’s predictions match with the actual data.</p><p id="50fc">In most Machine Learning algorithms we divide our whole dataset into Training Set(90%) and Test Set(10%). The models are trained on Training Data and the Test Data is never shown to these models. Once trained the real accuracy of any model is shown through the Test Set/Test Data.</p><blockquote id="ddc0"><p><b>Trivia</b></p></blockquote><blockquote id="effe"><p>In a lot of cases we also use Validation Set. This is used to fine tune the parameters of the model once it is trained. The split in this case generally is Training Set— 60% Validation Set — 20% Test Set — 20%</p></blockquote><p id="7ea1">In Random Forest we <b>don’t need to </b>set aside some data for test set<b>.</b> Due to bootstrapping we automatically get a subset of original data that can be used for testing!</p><p id="5452">Remember, while creating the bootstrapped dataset some rows are never chosen! They are left out because during the bootstrapping process we <b>sample with replacement</b>. We let the same data point be included multiple times. E.g. We let row 1 be included twice in the bootstrapped dataset. That resulted in row 3 being left out completely from the bootstrapped dataset. This left out data forms our test set. And it has a special name too— <b>Out-Of-Bag-Dataset</b></p><p id="543c" type="7">Each bootstrap sample leaves out about 37% of the examples — Leo Breiman</p><p id="f9f7">A study by Leo Breiman, a distinguished statistician at the University of California, Berkeley, proved that the <b>out-of-bag estimate </b>is as accurate as using<b> a test set </b>of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.</p><ul><li>These left-out rows in our original dataset form the “Out-Of-Bag-Dataset”.</li><li>Since we already know the prediction output of the Out-Of-Bag-Data we can run it through our Random Forest and See how many examples are predicted correctly.</li></ul><h1 id="9129">Testing the RF</h1><h2 id="739c">Let’s say</h2><ul><li>The total number of datapoints in our original dataset = <b>N</b></li><li>Number of points in Bootstrapped Dataset = <b>N</b></li><li>Number of unique points in Bootstrapped Dataset = <b>70% of N (assume the rest 30% is repeated, the actual statistic is most often close to 37%)</b></li></ul><p id="e341">Therefore, the size of Out-Of-Bag-Dataset = <b>30% of N.</b> Let’s call this quantity <b>M.</b></p><p id="a723">Now the procedure for testing the accuracy of our Random Forest is as follows:</p><ul><li>Run each of these <b>M</b> <b>datapoints </b>through our Random Forest.</li><li>Subject them to the same kind of Bagging process as shown above.</li><li>For each data point the Random Forest will predict the result based on majority voting of the trees.</li><li>Let’s say No. of mistakes that our random forest makes = C</li><li>So the number of correct predictions = M-C</li><li>Therefore, accuracy % of our Random Forest = 100*(M-C)/M</li></ul><p id="afad">As an example let’s say our original data had 15 samples. So N = 15. Therefore M = 5. Let’s say number of mistakes C = 1.</p><p id="ee9d">=> Accuracy = 100*(5–1)/5 = <b>80%</b></p><p id="7703">C being the number of mistakes our Random Forest makes on the Out-Of-Bag-Dataset is called the <b>Out-Of-Bag-Error.</b></p><p id="4e18">As an example Out-Of-Bag-Dataset has 5 data rows. For each of those random forest predicts. At the end we calculate the accuracy of the Random Forest.</p><figure id="2961"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*yoW30XVqAnKOA-7AArXqNg.gif"><figcaption>Random Forest is put to test!</figcaption></figure><h1 id="e1e9">TL;DR</h1><p id="78df">Random Forests algorithm is a classifier based on primarily two methods -</p><ul><li>Bagging</li><li>Random subspace method.</li></ul><p id="254c">We explain the bagging method in this article and show how to compute a Random Forest’s accuracy.</p><blockquote id="296f"><p>Suppose we decide to have <code>S</code> number of trees in our forest then we first create <code>S</code> datasets of <code>"same size as original"</code> created from random resampling of data in T with-replacement (n times for each dataset). This will result in <code>{T1, T2, ... TS}</code> datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset <code>Ti</code> can have duplicate data records and Ti can be missing several data records from original datasets. This is called <code>Bootstrapping</code>.</p></blockquote><blockquote id="e806"><p>After creating the classifiers (<code>S</code> trees), for each <code>(Xi,yi)</code> in the original training set i.e. <code>T</code>, select all <code>Tk</code> which does not include <code>(Xi,yi)</code>. This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples.</p></blockquote><blockquote id="2414"><p><a href="https://stackoverflow.com/questions/18541923/what-is-out-of-bag-error-in-random-forests">StackOverFlow</a></p></blockquote><p id="663f">References</p><ul><li><a href="http://proceedings.mlr.press/v32/denil14.pdf">http://proceedings.mlr.press/v32/denil14.pdf</a></li><li><a href="https://www.youtube.com/watch?v=J4Wdy0Wc_xQ">https://www.youtube.com/watch?v=J4Wdy0Wc_xQ</a></li><li><a href="https://stackoverflow.com/users/83602/manoj-awasthi">https://stackoverflow.com/users/83602/manoj-awasthi</a></li></ul><p id="1ae7"><i>X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. More such simplified AI concepts will follow. If you liked this or have some feedback or follow-up questions please comment below.</i></p><p id="6efa"><i>Thanks for Reading!</i></p></article></body>

Random Forests — An Intuitive Understanding

It is intriguing to see how simply and easily a Random Forest can yield extremely useful results. Random Forest is a Supervised Machine Learning algorithm. Examples of other supervised learning algorithms are Linear Regression, Logisitic Regression and Neural Networks.

Random Forests are essentially an ensemble of Decision Trees.

Decision trees are

  • easy to create,
  • easy to implement and
  • easy to interpret

But in practice they don’t prove to be very useful. Nevertheless, to understand Random Forests one must know the basic intuition behind Decision Trees.

In its simplest form a Decision tree is a sequence of choices.

A simple decision tree — Haroldsplanet.com

As far as accuracies of prediction go Decision Trees are quite inaccurate. Even one mis-step in the choice of the next node, can lead you to a completely different end. Choosing the right branch instead of the left could lead you to the furthest end of the tree. You would be off by a huge margin!

They work great with the data used to create them a.k.a The Training Set but the perform horribly with previously unseen data. This new data is generally called The Validation Set or The Test Set.

A huge number of decision trees created randomly using input data comprise the Random Forest. The randomness of creation combined with simplicity of a Decision Tree lends Random Forests their awesomness!

Random Forests = Simplicity of DT + Accuracy Through Randomness

Random Forest : The woods are lovely, dark and deep, But I have promises to keep, And miles to go before I sleep.

Random forests (Breiman, 2001) were originally conceived as a method of combining several CART (Classification And Regression Trees) (Breiman et al.,1984) style decision trees using bagging (Breiman, 1996). In the years since their introduction, random forests have grown from a single algorithm to an entire framework of models (Criminisi et al., 2011), and have been applied to great effect in a wide variety of fields — Computer Vision, Medical Image Analysis, Drug Discovery, Chemoinformatics etc.

Without going into further praise of Random Forests let’s see how to deal with them

  • How to build a RF?
  • Why does it work?
  • How to test a RF?

1) Building a Random Forest

Random Forest algorithm is a classifier based on primarily two methods. We will touch upon only the 1st method — Bagging.

  • Bootstrapping/Bagging
  • Random subspace method.

Step 1 —Bootstrapping /Bagging— To begin building a random forest we need to first create “Bootstrapped Datasets” from our original dataset. Bootstrapping is a statistical technique to pick up samples randomly from our dataset. In doing so we are allowed to pick up the same sample twice or even thrice or any number of times. In Statistical Jargon it’s known as Sampling with replacement.

Generally if there are S number of trees in the Random Forest we create S Bootstrapped Datasets.

Source: Wikipedia

Example

Let’s say we have Olympics 2020 coming up. Our Random Forest has to predict whether a player will win a medal this year or not. It takes the following 4 variables into account

  1. Has the player won Gold Medal previously?
  2. Has the player had any recent injury?
  3. Is there sponsorship available for the player?
  4. Player’s current world rank.

Also, for example’s sake let’s consider our dataset has only 4 players. Each row is a record/datapoint. So we already know that player with the ranks 49 and 50 player are predicted to not win any medals. Just the players with rank 2 and 7 are predicted to win.

Our Sweet Little Original Dataset

Now let’s create the bootstrapped dataset as step 1 of understanding what Random Forests are!

Let’s begin by choosing one row from the above table randomly. We end up picking row 4. It goes as row/record 1 in the Bootstrapped Table.

1st Sample: Pick Up the 4th Data Row from original dataset

Next we pick row 1 from the original table. Note that bootstrapped dataset has the same number of datapoints/rows as the original dataset.

2nd Sample: Randomly chose row 1 from original dataset

2 more samples later we have our bootstrapped dataset. Note that row 1 is picked twice randomly (it goes as row 2 and 4 in the bootstrapped table) and row 2 is never picked!!

Random sampling ‘with replacement’ causes Row 1 from Orig. Dataset to be picked twice.
The whole process!

Step 2 —Create a Decision Tree — Once we have a bootstrapped dataset we create a decision tree from it. Remember the most important thing while creating a decision tree is to know which is the best question to ask. As explained in the Decision Tree article Gini Impurity gives a measure of which question separates the data best.

  • Select a random subset (=say 2) of the 4 variables (Previous Gold, Injury, Sponsorship status and world ranking) as candidates for the root node.
  • Compute the Gini Impurity and figure out the best candidate out of the 2. i.e. which candidate best separates the data.
  • For the next node again select a random subset of the remaining 3 variables and repeat the process until no more variables are left. In jargon terms, this is called picking a random subset.
Step 2 in play

Step 3 — Rinse and Repeat — Once you have 1 decision tree start the whole iteration all over again. Create a new bootstrapped dataset and considering only a subset of variables at each step you can come up with a wide variety of Decision Trees. Keep repeating steps 1 and 2. Ideally you would need 100s of trees but for representation purpose we just show 6 trees below. Note each tree comes from a different bootstrapped dataset.

Create trees by considering random subsets of columns at each node
  • This randomness in the tree creation process is what lends flexibility and high accuracy to Random Forest

2) Why does it work?

Consider a company that is hiring for the position of CTO. A few people are assigned the role of interviewing the candidates, according to their own strengths.

  • CEO — Judges the leadership qualities
  • Tech. Director — Judges the technological awareness
  • HR Director — Judges the personality and cultural fit
  • Board Members — Judge the managerial qualities
  • A few star employees — Judge depth of knowledge in specific segments

Each of these interviewers have their own strengths that they have acquired over many years of experience. But these strengths also become biases! Each interviewer is poised to choose a candidate who’s strong in their own expertise area.

If it were upto just the Star Employees they would end up hiring a CTO who works great as an independent contributor but might not a be a visionary leader (which was CEOs role to judge).

Random Forest works similarly. Each tree is an interviewer. Each tree is biased to give results favouring a particular variable. Their combined results although come out to be pretty accurate. The whole company is the Random Forest and churns out more accurate results than an average employee a.k.a Decision Tree

Going with the example Dataset used before. Let’s say a new player registers for the race. His variables are as follows:

  • Previous Gold Winner : No
  • Any Recent Injury : No
  • Sponsorship Available : No
  • World Ranking : 7

The question is Will the player win a medal in this race?

Let’s see how and what does the Random Forest predict?

Yes the player will win a medal. The Random Forest predicts a victory for this player.

The process given above is called Bagging

Bootstrapping the data & using an aggregate of ‘individual results’ to obtain the ‘final result’ is called Bagging

3) How do we test the Random Forest?

How do we know if the results that our random forest gives us are correct? How do we trust and test it?

We use a subset of our original data to check if the Random Forest’s predictions match with the actual data.

In most Machine Learning algorithms we divide our whole dataset into Training Set(90%) and Test Set(10%). The models are trained on Training Data and the Test Data is never shown to these models. Once trained the real accuracy of any model is shown through the Test Set/Test Data.

Trivia

In a lot of cases we also use Validation Set. This is used to fine tune the parameters of the model once it is trained. The split in this case generally is Training Set— 60% Validation Set — 20% Test Set — 20%

In Random Forest we don’t need to set aside some data for test set. Due to bootstrapping we automatically get a subset of original data that can be used for testing!

Remember, while creating the bootstrapped dataset some rows are never chosen! They are left out because during the bootstrapping process we sample with replacement. We let the same data point be included multiple times. E.g. We let row 1 be included twice in the bootstrapped dataset. That resulted in row 3 being left out completely from the bootstrapped dataset. This left out data forms our test set. And it has a special name too— Out-Of-Bag-Dataset

Each bootstrap sample leaves out about 37% of the examples — Leo Breiman

A study by Leo Breiman, a distinguished statistician at the University of California, Berkeley, proved that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.

  • These left-out rows in our original dataset form the “Out-Of-Bag-Dataset”.
  • Since we already know the prediction output of the Out-Of-Bag-Data we can run it through our Random Forest and See how many examples are predicted correctly.

Testing the RF

Let’s say

  • The total number of datapoints in our original dataset = N
  • Number of points in Bootstrapped Dataset = N
  • Number of unique points in Bootstrapped Dataset = 70% of N (assume the rest 30% is repeated, the actual statistic is most often close to 37%)

Therefore, the size of Out-Of-Bag-Dataset = 30% of N. Let’s call this quantity M.

Now the procedure for testing the accuracy of our Random Forest is as follows:

  • Run each of these M datapoints through our Random Forest.
  • Subject them to the same kind of Bagging process as shown above.
  • For each data point the Random Forest will predict the result based on majority voting of the trees.
  • Let’s say No. of mistakes that our random forest makes = C
  • So the number of correct predictions = M-C
  • Therefore, accuracy % of our Random Forest = 100*(M-C)/M

As an example let’s say our original data had 15 samples. So N = 15. Therefore M = 5. Let’s say number of mistakes C = 1.

=> Accuracy = 100*(5–1)/5 = 80%

C being the number of mistakes our Random Forest makes on the Out-Of-Bag-Dataset is called the Out-Of-Bag-Error.

As an example Out-Of-Bag-Dataset has 5 data rows. For each of those random forest predicts. At the end we calculate the accuracy of the Random Forest.

Random Forest is put to test!

TL;DR

Random Forests algorithm is a classifier based on primarily two methods -

  • Bagging
  • Random subspace method.

We explain the bagging method in this article and show how to compute a Random Forest’s accuracy.

Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS} datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping.

After creating the classifiers (S trees), for each (Xi,yi) in the original training set i.e. T, select all Tk which does not include (Xi,yi). This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples.

StackOverFlow

References

X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. More such simplified AI concepts will follow. If you liked this or have some feedback or follow-up questions please comment below.

Thanks for Reading!

Machine Learning
Data Science
Random Forest
Artificial Intelligence
Statistics
Recommended from ReadMedium