avatarSalma El Shahawy

Summary

This article provides a hands-on guide to plotting a decision surface for machine learning in Python using matplotlib.

Abstract

The article begins with an introduction to the author's struggle with visualizing the results of a classification model, leading to the creation of a decision surface. The tutorial covers the implementation steps using a built-in dataset package within the Sklearn library and a pre-processed dataset. The article explains the concept of a decision surface, which is a demonstrative tool for explaining a model on a classification predictive modeling task. The author then provides a step-by-step guide to generating a decision surface using matplotlib, including importing important libraries, generating a dataset, developing a logistic regression model, and applying the model to real data. The article concludes with references and a call for constructive comments.

Opinions

  • The author believes that visualizing the results of a classification model has its charm and makes more sense of it.
  • The author recommends applying the same steps using another classification model, for example, SVM with more than two features.
  • The author hopes that this boilerplate could help in visualizing the classification model results.
  • The author is looking forward to any constructive comments.

Hands-on Guide to Plotting a Decision Surface for ML in Python

Utilize matplotlib to visualize decision boundaries for classification algorithms in Python

Photo by Jan Canty on Unsplash

Introduction

Lately, I have been struggling for a while to visualize the generated model of a classification model. I relied only on the classification report and the confusion matrix to weigh the model performance.

However, visualize the results of the classification has its charm and makes more sense of it. So, I built a decision surface, and when I succeeded, I decided to write about it as a learning process and for anyone who might have stuck on the same issue.

Tutorial content

In this tutorial, I will start with the built-in dataset package within the Sklearn library to focus on the implementation steps. After that, I will use a pre-processed data (without missing data or outliers) to plot the decision surface after applying the standard scaler.

  • Decision Surface
  • Importing important libraries
  • Dataset generation
  • Generating decision surface
  • Applying for real data

Decision Surface

Classification in machine learning means to train your data to assign labels to the input examples.

Each input feature is defining an axis on a feature space. A plane is characterized by a minimum of two input features, with dots representing input coordinates in the input space. If there were three input variables, the feature space would be a three-dimensional volume.

The ultimate goal of classification is to separate the feature space so that labels are assigned to points in the feature space as correctly as possible.

This method is called a decision surface or decision boundary, and it works as a demonstrative tool for explaining a model on a classification predictive modeling task. We can create a decision surface for each pair of input features if you have more than two input features.

Importing important libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

Generate dataset

I will use the make_blobs()function within the datasets class from the Sklearn library to generate a custom dataset. Doing so would focus on the implementations rather than cleaning the data. However, the steps are the same and are a typical pattern. Let’s start by defining the dataset variables with 1000 samples and only two features and a standard deviation of 3 for simplicity’s sake.

X, y = datasets.make_blobs(n_samples = 1000, 
                           centers = 2, 
                           n_features = 2, 
                           random_state = 1, 
                           cluster_std = 3)

Once the dataset is generated, hence we can plot a scatter plot to see the variability between variables.

# create scatter plot for samples from each class
for class_value in range(2):
    # get row indexes for samples with this class
    row_ix = np.where(y == class_value)
    # create scatter of these samples
    plt.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
plt.show()

Here we looped over the dataset and plotted points between each Xand y colored by a class label. In the next step, we need to build a predictive classification model to predict the class of unseen points. A logistic regression could be used in this case since we have only two categories.

scatter_plot_1

Develop the logistic regression model

regressor = LogisticRegression()
# fit the regressor into X and y
regressor.fit(X, y)
# apply the predict method 
y_pred = regressor.predict(X)

All y_predcould be evaluated using the accuracy_scoreclass from thesklearn library.

accuracy = accuracy_score(y, y_pred)
print('Accuracy: %.3f' % accuracy)
## Accuracy: 0.972

Generating decision surface

matplotlib provides a handy function called contour(), which can insert the colors between points. However, as the documentation suggested, we need to define the grid of points Xof yin the feature space. The beginning point would be to find the maximum value and minimum value of each feature then increase by one to make sure that the whole space is covered.

min1, max1 = X[:, 0].min() - 1, X[:, 0].max() + 1 #1st feature
min2, max2 = X[:, 1].min() - 1, X[:, 1].max() + 1 #2nd feature

Then we can define the scale of the coordinates using arange() function from the numpy library with a0.01 resolution to get the scale range.

x1_scale = np.arange(min1, max1, 0.1)
x2_scale = np.arange(min2, max2, 0.1)

The next step would be converting x1_scale and x2_scale into a grid. The function meshgrid() within the numpy library is what we need.

x_grid, y_grid = np.meshgrid(x1_scale, x2_scale)

The generated x_gridis a 2-D array. To be able to use it, we need to reduce the size to a one dimensional array using the flatten() method from thenumpy library.

# flatten each grid to a vector
x_g, y_g = x_grid.flatten(), y_grid.flatten()
x_g, y_g = x_g.reshape((len(x_g), 1)), y_g.reshape((len(y_g), 1))

Finally, stacking the vectors side-by-side as columns in an input dataset, like the original dataset, but at a much higher resolution.

grid = np.hstack((x_g, y_g))

Now, we can fit into the model to predict values.

# make predictions for the grid
y_pred_2 = model.predict(grid)
#predict the probability
p_pred = model.predict_proba(grid)
# keep just the probabilities for class 0
p_pred = p_pred[:, 0]
# reshaping the results
p_pred.shape
pp_grid = p_pred.reshape(x_grid.shape)

Now, a grid of values and the predicted class label across the feature space has been generated.

Subsequently, we will plot those grids as a contour plot using contourf(). The contourf()function needs separate grids per axis. To achieve that, we can utilize the x_gridand y_gridand reshape the predictions (y_pred)to have the same shape.

# plot the grid of x, y and z values as a surface
surface = plt.contourf(x_grid, y_grid, pp_grid, cmap='Pastel1')
plt.colorbar(surface)
# create scatter plot for samples from each class
for class_value in range(2):
# get row indexes for samples with this class
    row_ix = np.where(y == class_value)
    # create scatter of these samples
    plt.scatter(X[row_ix, 0], X[row_ix, 1], cmap='Pastel1')
# show the plot
plt.show()
decision_surface for two features

Apply to real data

Now it is time to apply the previous steps to real data to connect everything. As I mentioned earlier, this dataset is already cleaned with no missing points. The dataset represents car purchase history for a sample of people according to their age and salary per year.

dataset = pd.read_csv('../input/logistic-reg-visual/Social_Network_Ads.csv')
dataset.head()
Social_Network_Ads dataset

The dataset has two features Ageand EstimatedSalaryand one dependent variable purchased as a binary column. Value 0 represents the person with similar age, and salary that didn’t make a car purchase. However, one means that the person did purchase the car. The next step would be to separate the dependent variable from features as X and y

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(
                                               X, y, 
                                               test_size = 0.25,
                                               random_state = 0)

Feature scaling

We need this step because Age and salary is not on the same scale

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Building the Logistic model and fit the training data

classifier = LogisticRegression(random_state = 0)
# fit the classifier into train data
classifier.fit(X_train, y_train)
# predicting the value of y 
y_pred = classifier.predict(X_test)

Plot the decision surface — training results

#1. reverse the standard scaler on the X_train
X_set, y_set = sc.inverse_transform(X_train), y_train
#2. Generate decision surface boundaries
min1, max1 = X_set[:, 0].min() - 10, X_set[:, 0].max() + 10 # for Age
min2, max2 = X_set[:, 1].min() - 1000, X_set[:, 1].max() + 1000 # for salary
#3. Set coordinates scale accuracy
x_scale ,y_scale = np.arange(min1, max1, 0.25), np.arange(min2, max2, 0.25)
#4. Convert into vector 
X1, X2 = np.meshgrid(x_scale, y_scale)
#5. Flatten X1 and X2 and return the output as a numpy array
X_flatten = np.array([X1.ravel(), X2.ravel()])
#6. Transfor the results into it's original form before scaling
X_transformed = sc.transform(X_flatten.T)
#7. Generate the prediction and reshape it to the X to have the same shape
Z_pred = classifier.predict(X_transformed).reshape(X1.shape)
#8. set the plot size
plt.figure(figsize=(20,10))
#9. plot the contour function
plt.contourf(X1, X2, Z_pred,
                     alpha = 0.75, 
                     cmap = ListedColormap(('#386cb0', '#f0027f')))
#10. setting the axes limit
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
#11. plot the points scatter plot ( [salary, age] vs. predicted classification based on training set)
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], 
                X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green'))(i), 
                label = j)
    
#12. plot labels and adjustments
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
decision surface — training set

Decision plot for test set

It is exactly the same as the previous code, but instead of using train use test set.

Decision plot — test set

Conclusion

Finally, I hope this boilerplate could help in visualizing the classification model results. I recommend applying the same steps using another classification model, for example, SVM with more than two features. Thanks for reading, I am looking forward to any constructive comments.

References

  1. Sklearn.datasets API
  2. Utilizing pandas to transform data
  3. matplotlib.contour() API
  4. numpy.meshgrid() API
  5. Plot the decision surface of a decision tree on the iris dataset — sklearn example
  6. Full working Kaggle notebook
  7. GitHub repo
Machine Learning
Classification
Python
Data Science
Data Visualization
Recommended from ReadMedium