avatarDehan Chia

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4558

Abstract

ls. The extent of feature engineering you would perform on a dataset is largely dependent on the business context that you are operating in.</p><div id="892e"><pre><span class="hljs-comment"># setting up the 2 dimensional array of height and width predictors. other x vars discarded</span> <span class="hljs-attr">X_fruits_2d</span> = fruits[[<span class="hljs-string">'height'</span>, <span class="hljs-string">'width'</span>]] <span class="hljs-attr">y_fruits_2d</span> = fruits[<span class="hljs-string">'fruit_label'</span>] <span class="hljs-comment">#labels</span></pre></div><h2 id="6931">2. Train-Test Splitting</h2><p id="c73b">Now we’ll split the 59 entries with a 75/25 train-test split, where 75% of the data is used to train the KNN classification model and the remaining 25% is ‘hidden’ from the model and used to validate the results. A 75/25 split is the default split performed by <i>train_test_split.</i></p><div id="8338"><pre><span class="hljs-comment">#75 / 25 train test split</span> X_train, X_test, y_train, y_test = train_test_split(X_fruits, y_fruits, <span class="hljs-attribute">random_state</span>=0)</pre></div><h2 id="8d61">3. Feature Min-Max Scaling</h2><p id="7e55">Looking at the <i>mass</i> and <i>width</i> features, we notice that they’re both on different scales: <i>mass</i> values range into the double and triple digits, while <i>width</i> values typically range in the single digits. If values range too widely, the objective functions may not work properly, as some features may inadvertently exude higher influence on prediction.</p><p id="285a">Min-Max Scaling is a relatively simple method and is similar to apply Z-score normalization on distributions. Values are rescaled relative to the feature minimum and maximum values into a range between -1 and 1.</p><div id="d789"><pre><span class="hljs-attr">scaler</span> = MinMaxScaler() <span class="hljs-attr">X_train_scaled</span> = scaler.fit_transform(X_train) <span class="hljs-comment"># we must apply the scaling to the test set that we computed for the training set</span> <span class="hljs-attr">X_test_scaled</span> = scaler.transform(X_test)</pre></div><h2 id="f99e">4. Creating a fitted KNN Object</h2><p id="122f">We can create the an ‘empty’ KNN classifier model with the first line. Further, the <i>n_neighbours </i>argument allows control over our ‘K’ value. Next, the model is then fit against the scaled X features and their corresponding Y labels from the training dataset.</p><div id="2bcd"><pre>knn = KNeighborsClassifier<span class="hljs-params">(<span class="hljs-attr">n_neighbors</span> = 5)</span> <span class="hljs-comment">#setting up the KNN model to use 5NN</span> knn.fit<span class="hljs-params">(X_train_scaled, y_train)</span> <span class="hljs-comment">#fitting the KNN</span></pre></div><h2 id="485b">5. Assess performance</h2><p id="6ea3">Similar to how the <i>R Squared</i> metric is used to asses the goodness of fit of a simple linear model, we can use the <i>F-Score</i> to assess the KNN Classifier. The <i>F-Score</i> measures the accuracy of the model in predicting labels correctly. We can observe that the model is correct at predicting labels 95% of the time of the training data and 100% of the time on the held-out test dataset.</p><div id="6d44"><pre>#Checking performance <span class="hljs-keyword">on</span> the training <span class="hljs-keyword">set</span> <span class="hljs-keyword">print</span>('Accuracy of K-NN classifier <span class="hljs-keyword">on</span> training <span class="hljs-keyword">set</span>: {:.2f}' .<span class="hljs-keyword">format</span>(knn.<span class="hljs-keyword">score</span>(X_train_scaled, y_train))) #Checking performance <span class="hljs-keyword">on</span> the <span class="hljs-keyword">test</span> <span class="hljs-keyword">set</span> <span class="hljs-keyword">print</span>('Accuracy of K-NN classifier <span class="hljs-keyword">on</span> <span class="hljs-keyword">test</span> <span class="hljs-keyword">set</span>: {:.2f}' .<span class="hljs-keyword">format</span>(knn.<span class="hljs-keyword">score</span>(X_test_scaled, y_test)))</pre></div><figure id="b2e8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*sbVB8AqmsmfLt_1xFc-zLw.png"><figcaption></figcaption></figure><h2 id="230e">6. Making a prediction</h2><p id="02ac">Finally, we would want to use the model to make predictions. Given a fruit with <i>mass, width, height and color_score</i> of 5.5, 2.2, 10 and 0.70 respectively, what fruit is this? After the approproate Min-Max sca

Options

ling, the model predicts that it is a mandarin.</p><div id="7769"><pre>example_fruit = <span class="hljs-string">[[5.5, 2.2, 10, 0.70]]</span> example_fruit_scaled = scaler.transform(example_fruit) #Making an prediction based on x values <span class="hljs-built_in">print</span>(<span class="hljs-string">'Predicted fruit type for '</span>, example_fruit, <span class="hljs-string">' is '</span>, target_names_fruits[knn.predict(example_fruit_scaled)[<span class="hljs-number">0</span>]<span class="hljs-number">-1</span>])</pre></div><figure id="9808"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*KwCnfVXqlh9yFOBUqWcjUQ.png"><figcaption></figcaption></figure><h2 id="cfbf">7. Plotting</h2><p id="67f6">Lets look at visualizing the data using the <i>matplotlib</i> library. We can also further examine how the dataset behaves for varying values of K. Below I have plotted K = 1 , 5, 11.</p><p id="245b">Generally, we can observe that a lower value of K results in more overfitting of the training data. The model attempts to predict every point more accurately with lower K values. As a result we can observe that the boundaries between classification regions are jagged and change drastically with local changes.</p><p id="2449">Looking at the KNN with K = 11, we can see that the border between classification regions is relatively smoother. This KNN model has become relatively better at capturing the global trend, and would allow it to be more generalisable to a held out test set.</p><div id="e4e2"><pre>from adspy<span class="hljs-number"></span><span class="hljs-keyword">shared</span><span class="hljs-number">u</span>tilities <span class="hljs-keyword">import</span> plot<span class="hljs-number"></span>two<span class="hljs-number"></span><span class="hljs-keyword">class</span><span class="hljs-number">_k</span>nn</pre></div><div id="f0f9"><pre>X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2, <span class="hljs-attribute">random_state</span>=0)</pre></div><div id="f41c"><pre><span class="hljs-function"><span class="hljs-title">plot_two_class_knn</span><span class="hljs-params">(X_train, y_train, <span class="hljs-number">1</span>, ‘uniform’, X_test, y_test)</span></span> <span class="hljs-function"><span class="hljs-title">plot_two_class_knn</span><span class="hljs-params">(X_train, y_train, <span class="hljs-number">5</span>, ‘uniform’, X_test, y_test)</span></span> <span class="hljs-function"><span class="hljs-title">plot_two_class_knn</span><span class="hljs-params">(X_train, y_train, <span class="hljs-number">11</span>, ‘uniform’, X_test, y_test)</span></span></pre></div><figure id="0d38"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*kPd0IT2blO9_bPuE-JHwcw.png"><figcaption></figcaption></figure><figure id="f513"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*78XhTen0XncNEOJHSY7-dQ.png"><figcaption></figcaption></figure><figure id="0c7f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*dYDEUEKB-jUdPS_0YJze-A.png"><figcaption>K = 1 , 5 , 11</figcaption></figure><h1 id="b0b6">Endnotes</h1><p id="1876">I was able to learn this through the MOOC “Applied Machine Learning in Python” by the University of Michigan, hosted by Coursera.</p><p id="64d1">Do feel free to reach out to me on <a href="https://www.linkedin.com/in/dehan-c-948045177/">LinkedIn</a> if you have questions or would like to discuss post Covid-19 world!</p><p id="67d1">I hope that I was able to help you in learning about data science methods in one way or another!</p><p id="b766">Here’s another data science article for you!</p><div id="f3ea" class="link-block"> <a href="https://readmedium.com/linear-regressions-with-scikitlearn-a5d54efe898f"> <div> <div> <h2>Linear regressions with scikit-learn</h2> <div><h3>This article should be enough to cover how to run construct a simple linear regression in python; it will also contain…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*inGb1PmxP5WhZHZe)"></div> </div> </div> </a> </div><ul><li><a href="https://medium.com/tag/machine-learning">Machine Learning</a></li><li><a href="https://medium.com/tag/data-analysis">Data Analysis</a></li><li><a href="https://medium.com/tag/data-analytics">Data Analytics</a></li></ul></article></body>

K-Nearest-Neighbors in 6 steps

With scikit-learn in python

This aims to be an applied guide to utilizing the K-Nearest-Neighbors (KNN) method for solving business problems in python. The most popular use-case of KNN is in classification. Interestingly though, it is applicable to KNN regressions as well.

Photo by fabio on Unsplash

The Concept

Beginning with the foundations of KNN classifier models. KNN classifier models work in 3 broad steps to predict labels for unprecedented feature values (which are not in the training data).

  1. It memorizes the whole training test set — specifically which features resulted in which y label.
  2. It defines the K-nearest most similar instances, where K is a user defined integer. For a given data point, it looks at the nearest features and their respective labels.
  3. It predicts the new label as a function of nearest neighbors’ labels. Usually, this is a majority vote.

Circling back to KNN regressions: the difference is that KNN regression models works to predict new values from a continuous distribution for unprecedented feature values. Conceptually, how it arrives at a the predicted values is similar to KNN classification models, except that it will take the average value of it’s K-nearest neighbors.

K-Nearest-Neighbors Classifier

The packages

Let’s first import the required packages:

  1. numpy and pandas: data and array manipulation in python
  2. pyploy module from the matplotlib library: data visualisation
  3. sklearn modules for creating train-test splits, and creating the KNN object.
# Packages
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler

The data

The data set has 59 rows and 7 columns, with the first 10 rows shown below. To keep things simple, we’ll not use all the features; our aim is to use mass and width to predict the label, fruit_label.

#import data
fruits = pd.read_table('readonly/fruit_data_with_colors.txt')
feature_names_fruits = ['height', 'width', 'mass', 'color_score'] #x variable names
X_fruits = fruits[feature_names_fruits] #setting the col names
y_fruits = fruits['fruit_label'] #setting the col names
target_names_fruits = ['apple', 'mandarin', 'orange', 'lemon'] #potential classes
fruits.head(10)
fruits.head(10)

1. Feature engineering

Discarding unneeded features and labels. The extent of feature engineering you would perform on a dataset is largely dependent on the business context that you are operating in.

# setting up the 2 dimensional array of height and width predictors. other x vars discarded
X_fruits_2d = fruits[['height', 'width']]
y_fruits_2d = fruits['fruit_label'] #labels

2. Train-Test Splitting

Now we’ll split the 59 entries with a 75/25 train-test split, where 75% of the data is used to train the KNN classification model and the remaining 25% is ‘hidden’ from the model and used to validate the results. A 75/25 split is the default split performed by train_test_split.

#75 / 25 train test split
X_train, X_test, y_train, y_test = train_test_split(X_fruits, y_fruits, random_state=0)

3. Feature Min-Max Scaling

Looking at the mass and width features, we notice that they’re both on different scales: mass values range into the double and triple digits, while width values typically range in the single digits. If values range too widely, the objective functions may not work properly, as some features may inadvertently exude higher influence on prediction.

Min-Max Scaling is a relatively simple method and is similar to apply Z-score normalization on distributions. Values are rescaled relative to the feature minimum and maximum values into a range between -1 and 1.

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
# we must apply the scaling to the test set that we computed for the training set
X_test_scaled = scaler.transform(X_test)

4. Creating a fitted KNN Object

We can create the an ‘empty’ KNN classifier model with the first line. Further, the n_neighbours argument allows control over our ‘K’ value. Next, the model is then fit against the scaled X features and their corresponding Y labels from the training dataset.

knn = KNeighborsClassifier(n_neighbors = 5) #setting up the KNN model to use 5NN
knn.fit(X_train_scaled, y_train) #fitting the KNN

5. Assess performance

Similar to how the R Squared metric is used to asses the goodness of fit of a simple linear model, we can use the F-Score to assess the KNN Classifier. The F-Score measures the accuracy of the model in predicting labels correctly. We can observe that the model is correct at predicting labels 95% of the time of the training data and 100% of the time on the held-out test dataset.

#Checking performance on the training set
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(knn.score(X_train_scaled, y_train)))
#Checking performance on the test set
print('Accuracy of K-NN classifier on test set: {:.2f}'
     .format(knn.score(X_test_scaled, y_test)))

6. Making a prediction

Finally, we would want to use the model to make predictions. Given a fruit with mass, width, height and color_score of 5.5, 2.2, 10 and 0.70 respectively, what fruit is this? After the approproate Min-Max scaling, the model predicts that it is a mandarin.

example_fruit = [[5.5, 2.2, 10, 0.70]]
example_fruit_scaled = scaler.transform(example_fruit)
#Making an prediction based on x values
print('Predicted fruit type for ', example_fruit, ' is ', 
          target_names_fruits[knn.predict(example_fruit_scaled)[0]-1])

7. Plotting

Lets look at visualizing the data using the matplotlib library. We can also further examine how the dataset behaves for varying values of K. Below I have plotted K = 1 , 5, 11.

Generally, we can observe that a lower value of K results in more overfitting of the training data. The model attempts to predict every point more accurately with lower K values. As a result we can observe that the boundaries between classification regions are jagged and change drastically with local changes.

Looking at the KNN with K = 11, we can see that the border between classification regions is relatively smoother. This KNN model has become relatively better at capturing the global trend, and would allow it to be more generalisable to a held out test set.

from adspy_shared_utilities import plot_two_class_knn
X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2,
 random_state=0)
plot_two_class_knn(X_train, y_train, 1, ‘uniform’, X_test, y_test)
plot_two_class_knn(X_train, y_train, 5, ‘uniform’, X_test, y_test)
plot_two_class_knn(X_train, y_train, 11, ‘uniform’, X_test, y_test)
K = 1 , 5 , 11

Endnotes

I was able to learn this through the MOOC “Applied Machine Learning in Python” by the University of Michigan, hosted by Coursera.

Do feel free to reach out to me on LinkedIn if you have questions or would like to discuss post Covid-19 world!

I hope that I was able to help you in learning about data science methods in one way or another!

Here’s another data science article for you!

Data Science
Machine Learning
Python
Analytics
Coding
Recommended from ReadMedium