avatarBenjamin Obi Tayo Ph.D.

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

7155

Abstract

tring">'cabins'</span>)</pre></div><div id="f5b3"><pre>plt<span class="hljs-selector-class">.ylabel</span>(<span class="hljs-string">'crew'</span>)</pre></div><div id="b036"><pre>plt.<span class="hljs-keyword">title</span>(<span class="hljs-string">'scatter plot of crew vs. cabins'</span>)</pre></div><div id="48a0"><pre>plt.<span class="hljs-keyword">show</span>()</pre></div><figure id="0624"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5me_bI2_8o7No0jXIQB8aw.png"><figcaption><b>Figure 2</b>. Scatter plot of crew vs. cabins.</figcaption></figure><h2 id="c064">3.1 Simple linear regression using numpy</h2><div id="b448"><pre><span class="hljs-attr">z</span> = np.polyfit(X,y,<span class="hljs-number">1</span>)</pre></div><div id="3197"><pre><span class="hljs-attribute">p</span> <span class="hljs-operator">=</span> np.poly1d(z)</pre></div><div id="a674"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(p)</span></span></pre></div><p id="6235"><b>Output</b>: 0.745 x + 1.216</p><p id="3e62">This shows that the fitted slope is m = 0.745, and the intercept is c = 1.216.</p><div id="73d3"><pre><span class="hljs-attr">y_pred_numpy</span> = p(X)</pre></div><div id="ada9"><pre><span class="hljs-attribute">R2_numpy</span> = <span class="hljs-number">1</span> - ((y-y_pred_numpy)<span class="hljs-number">2</span>).sum()/((y-y.mean())<span class="hljs-number">2</span>).sum()</pre></div><div id="be89"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(R2_numpy)</span></span></pre></div><p id="016e"><b>Output</b>: R2_numpy = 0.9040636287611352</p><div id="a5fa"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(r2_score(y, y_pred_numpy)</span></span>)</pre></div><p id="d93b"><b>Output</b>: 0.9040636287611352</p><p id="5748">Let’s now plot the actual and predicted values:</p><div id="883c"><pre><span class="hljs-attribute">plt</span>.figure(figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">7</span>))</pre></div><div id="0c21"><pre>plt.scatter(X,y,<span class="hljs-attribute">c</span>=<span class="hljs-string">'steelblue'</span>, <span class="hljs-attribute">edgecolor</span>=<span class="hljs-string">'white'</span>, <span class="hljs-attribute">s</span>=70, <span class="hljs-attribute">label</span>=<span class="hljs-string">'actual'</span>)</pre></div><div id="8dfd"><pre>plt.plot(X,y_pred_numpy, <span class="hljs-attribute">color</span>=<span class="hljs-string">'black'</span>, <span class="hljs-attribute">lw</span>=2, <span class="hljs-attribute">label</span>=<span class="hljs-string">'predicted'</span>)</pre></div><div id="91b1"><pre>plt<span class="hljs-selector-class">.xlabel</span>(<span class="hljs-string">'cabins'</span>)</pre></div><div id="0bc1"><pre>plt<span class="hljs-selector-class">.ylabel</span>(<span class="hljs-string">'crew'</span>)</pre></div><div id="1c1a"><pre>plt.<span class="hljs-keyword">title</span>(<span class="hljs-string">'actual and fitted plots'</span>)</pre></div><div id="96d3"><pre>plt.<span class="hljs-built_in">legend</span>()</pre></div><div id="be70"><pre>plt.<span class="hljs-keyword">show</span>()</pre></div><figure id="786a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*_-UmjNLUXGz203_lo3IM-w.png"><figcaption><b>Figure 3</b>. Actual and fitted plots for crew vs. cabins.</figcaption></figure><h2 id="f64c">3.2 Simple linear regression using Pylab</h2><div id="9722"><pre><span class="hljs-attribute">degree</span> <span class="hljs-operator">=</span> <span class="hljs-number">1</span></pre></div><div id="c810"><pre><span class="hljs-attr">model</span>= pylab.polyfit(X,y,degree)</pre></div><div id="62ba"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(model)</span></span></pre></div><p id="2de2"><b>Output</b>: array([0.7449974 , 1.21585013]). We see again that the slope is m = 0.745, and the intercept is c = 1.216.</p><div id="d533"><pre><span class="hljs-attr">y_pred_pylab</span> = pylab.polyval(model,X)</pre></div><div id="9f1c"><pre>R<span class="hljs-number">2</span>pylab = <span class="hljs-number">1</span> - <span class="hljs-comment">((y-y_pred_pylab)</span><span class="hljs-number">2</span>).sum<span class="hljs-comment">()</span>/<span class="hljs-comment">((y-y.mean()</span>)<span class="hljs-number">2</span>).sum<span class="hljs-comment">()</span></pre></div><div id="a9f0"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(R2_pylab)</span></span></pre></div><p id="cec1"><b>Output</b>: R2_pylab = 0.9040636287611352</p><div id="2da0"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(r2_score(y, y_pred_pylab)</span></span>)</pre></div><p id="4522"><b>Output</b>: 0.9040636287611352</p><h2 id="c854">3.3 Simple linear regression using scikit-learn</h2><div id="e692"><pre><span class="hljs-attribute">lr</span> <span class="hljs-operator">=</span> LinearRegression()</pre></div><div id="735d"><pre>lr<span class="hljs-selector-class">.fit</span>(X<span class="hljs-selector-class">.values</span><span class="hljs-selector-class">.reshape</span>(-<span class="hljs-number">1</span>,<span class="hljs-number">1</span>),y)</pre></div><div id="c83e"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(lr.coef)</span></span></pre></div><div id="d513"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(lr.intercept_)</span></span></pre></div><p id="7e7f"><b>Output</b>: [0.7449974], 1.2158501299368671. We see again that the slope is m = 0.745, and the intercept is c = 1.216.</p><div id="d21c"><pre>y_pred_sklearn = lr<span class="hljs-selector-class">.predict</span>(X<span class="hljs-selector-class">.values</span><span class="hljs-selector-class">.reshape</span>(-<span class="hljs-number">1</span>,<span class="hljs-number">1</span>))</pre></div><div id="fa2f"><pre><span class="hljs-attribute">R2_sklearn</span> = <span class="hljs-number">1</span> - ((y-y_pred_sklearn)<span class="hljs-number">2</span>).sum()/((y-y.mean())<span class="hljs-number">2</span>).sum()</pre></div><div id="ec6d"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(R2_sklearn)</span></span></pre></div><p id="8897"><b>Output</b>: R2_sklearn = 0.9040636287611352</p><div id="0167"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(r2_score(y, y_pred_sklearn)</span></span>)</pre></div><p id="fdf6"><b>Output</b>: 0.9040636287611352</p><p id="4551">We observe that all 3 methods for basic linear regression (NumPy, Pylab, and Scikit-learn) gave consistent results.</p><h1 id="b9e0">4. Multiple Linear Regression with Scikit-Learn</h1><p id="3d77">From the covariance matrix plot above (<b>Figure 1</b>), we see that the “crew” variable correlates strongly (correlation coefficient ≥ 0.6) with 4 predictor variables

Options

: “Tonnage”, “passengers”, “length, and “cabins”. We can, therefore, build a multiple regression model of the form:</p><figure id="16de"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*fr7vcW_nnUxjqEPUYcYqMQ.png"><figcaption></figcaption></figure><p id="24ad">where X is the features matrix, w_0 is the intercept, and w_1, w_2, w_3, and w_4 are the regression coefficients.</p><h2 id="d08a">4.1 Define features matrix and the target variable</h2><div id="3edb"><pre><span class="hljs-attr">cols_selected</span> = [<span class="hljs-string">'Tonnage'</span>, <span class="hljs-string">'passengers'</span>, <span class="hljs-string">'length'</span>, <span class="hljs-string">'cabins'</span>,<span class="hljs-string">'crew'</span>]</pre></div><div id="101e"><pre><span class="hljs-built_in">df</span>[cols_selected].<span class="hljs-built_in">head</span>()</pre></div><div id="ce08"><pre><span class="hljs-attr">X</span> = df[cols_selected].iloc[:,<span class="hljs-number">0</span>:<span class="hljs-number">4</span>].values <span class="hljs-comment"># features matrix </span></pre></div><div id="43a4"><pre><span class="hljs-attr">y</span> = df[cols_selected][<span class="hljs-string">'crew'</span>].values <span class="hljs-comment"># target variable</span></pre></div><figure id="10ca"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*vkKrWoHxcwhC9soq.png"><figcaption><b>Table 2</b>. First 5 rows of important features and predictor variables.</figcaption></figure><h2 id="7e31">4.2 Model building and evaluation</h2><div id="2977"><pre>X_train, X_test, y_train, y_test = train_test_split( X, y, <span class="hljs-attribute">test_size</span>=0.3, <span class="hljs-attribute">random_state</span>=0)</pre></div><div id="a691"><pre><span class="hljs-attr">sc_y</span> = StandardScaler()</pre></div><div id="50be"><pre>y_train_std = sc_y<span class="hljs-selector-class">.fit_transform</span>(y_train<span class="hljs-selector-attr">[:,np.newaxis]</span>)<span class="hljs-selector-class">.flatten</span>()</pre></div><div id="7f3d"><pre>pipe_lr.fit<span class="hljs-comment">(X_train, y_train_std)</span></pre></div><div id="4438"><pre><span class="hljs-attr">y_train_pred</span> = sc_y.inverse_transform(pipe_lr.predict(X_train))</pre></div><div id="2959"><pre><span class="hljs-attr">y_test_pred</span> = sc_y.inverse_transform(pipe_lr.predict(X_test))</pre></div><div id="9702"><pre><span class="hljs-attr">r2_score_train</span> = r2_score(y_train, y_train_pred)</pre></div><div id="3a28"><pre><span class="hljs-attr">r2_score_test</span> = r2_score(y_test, y_test_pred)</pre></div><div id="a263"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">'R2 train for lr: %.3f'</span> % r2_score_train)</span></span></pre></div><div id="6e9f"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">'R2 test for lr: %.3f '</span> % r2_score_test)</span></span></pre></div><p id="c691"><b>Output:</b></p><p id="8f6c">R2 train for lr: 0.912 R2 test for lr: 0.958</p><h2 id="5bd9">4.3 Plot the output</h2><div id="b01c"><pre>plt.scatter(y_train, y_train_pred, <span class="hljs-attribute">c</span>=<span class="hljs-string">'steelblue'</span>, <span class="hljs-attribute">edgecolor</span>=<span class="hljs-string">'white'</span>, <span class="hljs-attribute">s</span>=70, <span class="hljs-attribute">label</span>=<span class="hljs-string">'fitted'</span>)</pre></div><div id="503b"><pre>plt.plot(y_train, y_train, c = <span class="hljs-string">'red'</span>, lw = 2,<span class="hljs-attribute">label</span>=<span class="hljs-string">'ideal'</span>)</pre></div><div id="28d5"><pre>plt.xlabel(<span class="hljs-symbol">'actual</span> crew')</pre></div><div id="5fbc"><pre>plt.ylabel(<span class="hljs-symbol">'predicted</span> crew')</pre></div><div id="cdad"><pre>plt.<span class="hljs-built_in">legend</span>()</pre></div><div id="7a38"><pre>plt.<span class="hljs-keyword">show</span>()</pre></div><figure id="62d2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ibGGo643LZtGT5fUkl2-mg.png"><figcaption><b>Figure 4</b>. Ideal and fitted plots for the crew variable using multiple regression analysis.</figcaption></figure><h1 id="069d">5. Summary</h1><p id="d731">In summary, we’ve presented a tutorial on simple and multiple regression analysis using different libraries such as NumPy, Pylab, and Scikit-learn. Linear regression is the most popular machine learning algorithm. A thorough understanding of linear regression would serve as a good foundation for understanding other machine learning algorithms such as logistic regression, K-nearest neighbor, and support vector machine.</p><h1 id="990d">Additional Data Science/Machine Learning Resources</h1><p id="64e8"><a href="https://towardsdatascience.com/data-science-minimum-10-essential-skills-you-need-to-know-to-start-doing-data-science-e5a5a9be5991">Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science</a></p><p id="d1fd"><a href="https://readmedium.com/data-science-curriculum-bf3bb6805576">Data Science Curriculum</a></p><p id="6fb8"><a href="https://readmedium.com/4-math-skills-for-machine-learning-12bfbc959c92">Essential Maths Skills for Machine Learning</a></p><p id="2061"><a href="https://readmedium.com/3-best-data-science-mooc-specializations-d58da382f628">3 Best Data Science MOOC Specializations</a></p><p id="b116"><a href="https://towardsdatascience.com/5-best-degrees-for-getting-into-data-science-c3eb067883b1">5 Best Degrees for Getting into Data Science</a></p><p id="2cc4"><a href="https://towardsdatascience.com/5-reasons-why-you-should-begin-your-data-science-journey-in-2020-2b4a0a5e4239">5 reasons why you should begin your data science journey in 2020</a></p><p id="9bee"><a href="https://towardsdatascience.com/theoretical-foundations-of-data-science-should-i-care-or-simply-focus-on-hands-on-skills-c53fb0caba66">Theoretical Foundations of Data Science — Should I Care or Simply Focus on Hands-on Skills?</a></p><p id="ac55"><a href="https://towardsdatascience.com/machine-learning-project-planning-71bdb3a44349">Machine Learning Project Planning</a></p><p id="979d"><a href="https://towardsdatascience.com/how-to-organize-your-data-science-project-dd6599cf000a">How to Organize Your Data Science Project</a></p><p id="aaa3"><a href="https://readmedium.com/productivity-tools-for-large-scale-data-science-projects-64810dfbb971">Productivity Tools for Large-scale Data Science Projects</a></p><p id="01f4"><a href="https://towardsdatascience.com/a-data-science-portfolio-is-more-valuable-than-a-resume-2d031d6ce518">A Data Science Portfolio is More Valuable than a Resume</a></p><p id="ab22"><a href="https://readmedium.com/data-science-101-a-short-course-on-medium-platform-with-r-and-python-code-included-3cdc9d489c6d">Data Science 101 — A Short Course on Medium Platform with R and Python Code Included</a></p><p id="7fea"><b><i>For questions and inquiries, please email me</i></b>: [email protected]</p></article></body>

Image by Benjamin O. Tayo

Data Science

Linear Regression Basics for Absolute Beginners

Tutorial on simple and multiple regression analysis using NumPy, Pylab, and Scikit-learn

1. Introduction

Regression models are the most popular machine learning models. Regression models are used for predicting target variables on a continuous scale. Regression models find applications in almost every field of study, and as a result, it is one of the most widely used machine learning models. This article will discuss the basics of linear regression and is intended for beginners in the field of data science.

Using the cruise ship dataset cruise_ship_info.csv, we will demonstrate simple and multiple regression analysis using NumPy, Pylab, and Scikit-learn. Because this is just an introductory tutorial, no distinction between inliers and outliers shall be made (outliers can be handled using more robust methods such as the RANSAC regression).

2. Data Analysis

2.1 Import Necessary Libraries

import numpy as np
import pandas as pd
import pylab
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
pipe_lr = Pipeline([('scl', StandardScaler()),
                    ('lr', LinearRegression())])

2.2 Read dataset and display columns

df = pd.read_csv("cruise_ship_info.csv")
df.head()
Table 1: Shows the first 5 rows of the dataset.

2.3 Calculate the covariance matrix

cols = ['Age', 'Tonnage', 'passengers', 'length', 
                      'cabins','passenger_density','crew']

stdsc = StandardScaler()
X_std = stdsc.fit_transform(df[cols].iloc[:,range(0,7)].values
cov_mat = np.cov(X_std.T)

2.4 Generate a heatmap for visualizing the covariance matrix

plt.figure(figsize=(10,10))sns.set(font_scale=1.5)hm = sns.heatmap(cov_mat,
                 cbar=True,
                 annot=True,
                 square=True,
                 fmt='.2f',
                 annot_kws={'size': 12},
                 yticklabels=cols,
                 xticklabels=cols)
plt.title('Covariance matrix showing correlation coefficients')
plt.tight_layout()
plt.show()
Figure 1. Covariance matrix plot.

3. Simple Linear Regression

In simple linear regression, there is only one predictor variable. Since our goal is to predict the crew variable, we see from Figure 1 that the cabins variable correlates the most with the crew variable. Hence our simple regression model can be expressed in the form:

where m is the slope or regression coefficient, and c is the intercept. The model will be evaluated using the R2 score metric which can be calculated as follows:

The R2 score takes values between 0 and 1. When R2 is close to 1, it means the predicted values agree closely with the actual values. If R2 is close to zero, then it means the predictive power of the model is very poor.

Let’s now define and plot our independent and dependent variables:

X = df['cabins']
y = df['crew']
plt.scatter(X,y,c='steelblue', edgecolor='white', s=70)
plt.xlabel('cabins')
plt.ylabel('crew')
plt.title('scatter plot of crew vs. cabins')
plt.show()
Figure 2. Scatter plot of crew vs. cabins.

3.1 Simple linear regression using numpy

z = np.polyfit(X,y,1)
p = np.poly1d(z)
print(p)

Output: 0.745 x + 1.216

This shows that the fitted slope is m = 0.745, and the intercept is c = 1.216.

y_pred_numpy = p(X)
R2_numpy = 1 - ((y-y_pred_numpy)**2).sum()/((y-y.mean())**2).sum()
print(R2_numpy)

Output: R2_numpy = 0.9040636287611352

print(r2_score(y, y_pred_numpy))

Output: 0.9040636287611352

Let’s now plot the actual and predicted values:

plt.figure(figsize=(10,7))
plt.scatter(X,y,c='steelblue', edgecolor='white', s=70, 
             label='actual')
plt.plot(X,y_pred_numpy, color='black', lw=2, label='predicted')
plt.xlabel('cabins')
plt.ylabel('crew')
plt.title('actual and fitted plots')
plt.legend()
plt.show()
Figure 3. Actual and fitted plots for crew vs. cabins.

3.2 Simple linear regression using Pylab

degree = 1
model= pylab.polyfit(X,y,degree)
print(model)

Output: array([0.7449974 , 1.21585013]). We see again that the slope is m = 0.745, and the intercept is c = 1.216.

y_pred_pylab = pylab.polyval(model,X)
R2_pylab = 1 - ((y-y_pred_pylab)**2).sum()/((y-y.mean())**2).sum()
print(R2_pylab)

Output: R2_pylab = 0.9040636287611352

print(r2_score(y, y_pred_pylab))

Output: 0.9040636287611352

3.3 Simple linear regression using scikit-learn

lr = LinearRegression()
lr.fit(X.values.reshape(-1,1),y)
print(lr.coef_)
print(lr.intercept_)

Output: [0.7449974], 1.2158501299368671. We see again that the slope is m = 0.745, and the intercept is c = 1.216.

y_pred_sklearn = lr.predict(X.values.reshape(-1,1))
R2_sklearn = 1 - ((y-y_pred_sklearn)**2).sum()/((y-y.mean())**2).sum()
print(R2_sklearn)

Output: R2_sklearn = 0.9040636287611352

print(r2_score(y, y_pred_sklearn))

Output: 0.9040636287611352

We observe that all 3 methods for basic linear regression (NumPy, Pylab, and Scikit-learn) gave consistent results.

4. Multiple Linear Regression with Scikit-Learn

From the covariance matrix plot above (Figure 1), we see that the “crew” variable correlates strongly (correlation coefficient ≥ 0.6) with 4 predictor variables: “Tonnage”, “passengers”, “length, and “cabins”. We can, therefore, build a multiple regression model of the form:

where X is the features matrix, w_0 is the intercept, and w_1, w_2, w_3, and w_4 are the regression coefficients.

4.1 Define features matrix and the target variable

cols_selected = ['Tonnage', 'passengers', 'length', 'cabins','crew']
df[cols_selected].head()
X = df[cols_selected].iloc[:,0:4].values    # features matrix 
y = df[cols_selected]['crew'].values        # target variable
Table 2. First 5 rows of important features and predictor variables.

4.2 Model building and evaluation

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0)
sc_y = StandardScaler()
y_train_std = sc_y.fit_transform(y_train[:,np.newaxis]).flatten()
pipe_lr.fit(X_train, y_train_std)
y_train_pred = sc_y.inverse_transform(pipe_lr.predict(X_train))
y_test_pred = sc_y.inverse_transform(pipe_lr.predict(X_test))
r2_score_train = r2_score(y_train, y_train_pred)
r2_score_test = r2_score(y_test, y_test_pred)
print('R2 train for lr: %.3f' % r2_score_train)
print('R2 test for lr:  %.3f ' % r2_score_test)

Output:

R2 train for lr: 0.912 R2 test for lr: 0.958

4.3 Plot the output

plt.scatter(y_train, y_train_pred, c='steelblue', edgecolor='white', s=70, label='fitted')
plt.plot(y_train, y_train, c = 'red', lw = 2,label='ideal')
plt.xlabel('actual crew')
plt.ylabel('predicted crew')
plt.legend()
plt.show()
Figure 4. Ideal and fitted plots for the crew variable using multiple regression analysis.

5. Summary

In summary, we’ve presented a tutorial on simple and multiple regression analysis using different libraries such as NumPy, Pylab, and Scikit-learn. Linear regression is the most popular machine learning algorithm. A thorough understanding of linear regression would serve as a good foundation for understanding other machine learning algorithms such as logistic regression, K-nearest neighbor, and support vector machine.

Additional Data Science/Machine Learning Resources

Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science

Data Science Curriculum

Essential Maths Skills for Machine Learning

3 Best Data Science MOOC Specializations

5 Best Degrees for Getting into Data Science

5 reasons why you should begin your data science journey in 2020

Theoretical Foundations of Data Science — Should I Care or Simply Focus on Hands-on Skills?

Machine Learning Project Planning

How to Organize Your Data Science Project

Productivity Tools for Large-scale Data Science Projects

A Data Science Portfolio is More Valuable than a Resume

Data Science 101 — A Short Course on Medium Platform with R and Python Code Included

For questions and inquiries, please email me: [email protected]

Data Science
Machine Learning
Linear Regression
Python
Analytics
Recommended from ReadMedium