avatarNaina Chaturvedi

Summary

The web content outlines a tutorial on using Yellowbrick for machine learning visualization, emphasizing the importance of visual tools for model evaluation, and provides a sneak peek into upcoming projects and resources in data science and machine learning.

Abstract

The provided web content is part of a series on data science and machine learning, focusing on Day 52 of the series. It introduces Yellowbrick, a visualization library that integrates with scikit-learn to create visualizations for the machine learning workflow. The article guides readers through installing Yellowbrick, analyzing text data, and using various visualization tools to assess feature importance, target distribution, model performance, cross-validation scores, and learning curves. The author encourages reader participation, regardless of technical skill level, and teases upcoming content such as machine learning pipelines, recurrent neural networks, clustering geolocation data, facial expression recognition, hyperparameter tuning, and custom Keras layers. The content concludes with a motivational quote, reinforcing the passion required for success in tech endeavors.

Opinions

  • The author values the role of visualization in understanding and evaluating machine learning models, as evidenced by the detailed walkthrough of Yellowbrick's features.
  • There is an emphasis on community engagement, with open invitations for contributions such as bug reports, user testing, and feature requests for Yellowbrick.
  • The author suggests that Yellowbrick's visualizations are accessible to individuals with varying levels of technical expertise, indicating an inclusive approach to data science education.
  • The mention of a tech newsletter and the encouragement to subscribe imply that the author sees ongoing communication and learning as crucial components of the tech community.
  • By providing a roadmap of future topics, the author indicates a commitment to continuous learning and sharing knowledge within the field of data science and machine learning.
  • The inclusion of a quote by Steve Jobs at the end of the content reflects the author's belief in the importance of passion and perseverance in technological pursuits.

Day 52: 60 days of Data Science and Machine Learning Series

Yellowbrick combines scikit-learn with matplotlib and provides the scikit-learn API to produce visualizations for the machine learning workflow. A good reference point to understand the vastness of Yellowbrick and how to use it —

Some of the other best Series —

30 Days of Natural Language Processing ( NLP) Series

30 days of Data Engineering with projects Series

60 days of Data Science and ML Series with projects

100 days : Your Data Science and Machine Learning Degree Series with projects

23 Data Science Techniques You Should Know

Tech Interview Series — Curated List of coding questions

Complete System Design with most popular Questions Series

Complete Data Visualization and Pre-processing Series with projects

Complete Python Series with Projects

Complete Advanced Python Series with Projects

Kaggle Best Notebooks that will teach you the most

Complete Developers Guide to Git

All the Data Science and Machine Learning Resources

210 Machine Learning Projects

30 days of Machine Learning Ops

Tech Newsletter —

If you are interested, you can join my newsletter through which I send tech interview tips, techniques, patterns, hacks — Software Development, ML, Data Science, Startups and Technology projects to more than 30K readers. You can subscribe to Tech Brew :

You can install yellowbrick using the command below —

$ pip install yellowbrick

In this post, we will analyze the text data using Yellowbrick and assess document similarity, topic modelling etc that are predicated on the notion of “similarity” between documents.

Let’s dive in!

Import necessary libraries

import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import warnings
import numpy as np
from pylab import rcParams
import seaborn as sns; sns.set(style="ticks", color_codes=True)
rcParams['figure.figsize'] = 15, 10
warnings.simplefilter('ignore')
from yellowbrick.features.importances import FeatureImportances
from sklearn.linear_model import Lasso
from yellowbrick.target import BalancedBinningReference
from yellowbrick.regressor import PredictionError
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from yellowbrick.model_selection import CVScores
from yellowbrick.model_selection import LearningCurve
from sklearn.linear_model import LassoCV

Load Data

df = pd.read_csv('Path to the data file/data.csv')
df.info()

Output —

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
cement      1030 non-null float64
slag        1030 non-null float64
ash         1030 non-null float64
water       1030 non-null float64
splast      1030 non-null float64
coarse      1030 non-null float64
fine        1030 non-null float64
age         1030 non-null int64
strength    1030 non-null float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB

Data Preprocessing

f = [ 'cement','ash','slag','water','coarse','splast','fine','age']
target = 'strength'
X= df[f]
y = df[target]

Pairwise Scatterplot

sns.pairplot(df)

Output —

Feature Imp

fig = plt.figure()
ax = fig.add_subplot()
labels = list(map(lambda s: s.title(),f))
v = FeatureImportances(Lasso(),ax=ax,labels=labels,relative=False)
v.fit(X,y)
v.poof()

Output —

Visualize the target

v = BalancedBinningReference()
v.fit(y)
v.poof()

Output —

Evaluate Lasso Regression

X_train, X_test, y_train,y_test = train_test_split(X,y,test_size=0.3)
v = PredictionError(Lasso(),size=(600,400))
v.fit(X_train,y_train)
v.score(X_test,y_test)
v.finalize()

Output —

Cross Validation Scores

_,ax = plt.subplots()
cv = KFold(12)
v = CVScores(Lasso(),ax=ax,cv=cv,scoring='r2')
v.fit(X_train,y_train).poof()

Output —

Learning Curves

s = np.linspace(0.3,1.0,10)
v=LearningCurve(LassoCV(),train_sizes=s,scoring='r2')
v.fit(X,y).poof()

Output —

Learnings —

How to perform performance evaluation of regression models using visual tools from Yellowbrick

Day 53: Coming soon!

Follow and Stay tuned. Keep coding :)

For other projects, tune to —

Build Machine Learning Pipelines( With Code)

Recurrent Neural Network with Keras

Clustering Geolocation Data in Python using DBSCAN and K-Means

Facial Expression Recognition using Keras

Hyperparameter Tuning with Keras Tuner

Custom Layers in Keras

That’s it fellas. Peace out and keep coding :)

Stay Tuned and of-course let me end this post with a quote by Steve Jobs ;)

“You have to be burning with an idea, or a problem, or a wrong that you want to right. If you’re not passionate enough from the start, you’ll never stick it out.”

Machine Learning
Artificial Intelligence
Programming
Tech
Data Science
Recommended from ReadMedium