avatarBenjamin Obi Tayo Ph.D.

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

8205

Abstract

train) y_train_pred = slr.<span class="hljs-keyword">predict</span>(X_train) y_test_pred = slr.<span class="hljs-keyword">predict</span>(X_test)</pre></div><div id="1236"><pre>plt.scatter(y_train_pred, y_train_pred - y_train, <span class="hljs-attribute">c</span>=<span class="hljs-string">'steelblue'</span>, <span class="hljs-attribute">marker</span>=<span class="hljs-string">'o'</span>, <span class="hljs-attribute">edgecolor</span>=<span class="hljs-string">'white'</span>, <span class="hljs-attribute">label</span>=<span class="hljs-string">'Training data'</span>) plt.scatter(y_test_pred, y_test_pred - y_test, <span class="hljs-attribute">c</span>=<span class="hljs-string">'limegreen'</span>, <span class="hljs-attribute">marker</span>=<span class="hljs-string">'s'</span>, <span class="hljs-attribute">edgecolor</span>=<span class="hljs-string">'white'</span>, <span class="hljs-attribute">label</span>=<span class="hljs-string">'Test data'</span>) plt.xlabel(<span class="hljs-string">'Predicted values'</span>) plt.ylabel(<span class="hljs-string">'Residuals'</span>) plt.legend(<span class="hljs-attribute">loc</span>=<span class="hljs-string">'upper left'</span>) plt.hlines(<span class="hljs-attribute">y</span>=0, <span class="hljs-attribute">xmin</span>=-10, <span class="hljs-attribute">xmax</span>=50, <span class="hljs-attribute">color</span>=<span class="hljs-string">'black'</span>, <span class="hljs-attribute">lw</span>=2) plt.xlim([-10, 50]) plt.tight_layout() plt.legend(<span class="hljs-attribute">loc</span>=<span class="hljs-string">'lower right'</span>) plt.show()</pre></div><figure id="ba46"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jj8FzopZoRnC2OR8y_Ip6Q.png"><figcaption></figcaption></figure><h2 id="dec9">7a. Evaluation of regression model</h2><div id="0541"><pre><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> r2_score <span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_squared_error</pre></div><div id="ccb8"><pre>print('MSE train: %.<span class="hljs-number">3</span>f, test: %.<span class="hljs-number">3</span>f' % ( <span class="hljs-name">mean_squared_error</span>(<span class="hljs-name">y_train</span>, y_train_pred), mean_squared_error(<span class="hljs-name">y_test</span>, y_test_pred))) print('R^<span class="hljs-number">2</span> train: %.<span class="hljs-number">3</span>f, test: %.<span class="hljs-number">3</span>f' % ( <span class="hljs-name">r2_score</span>(<span class="hljs-name">y_train</span>, y_train_pred), r2_score(<span class="hljs-name">y_test</span>, y_test_pred)))</pre></div><div id="c0ac"><pre><span class="hljs-attribute">MSE</span> train: <span class="hljs-number">0</span>.<span class="hljs-number">955</span>, test: <span class="hljs-number">0</span>.<span class="hljs-number">889</span> <span class="hljs-attribute">R</span>^<span class="hljs-number">2</span> train: <span class="hljs-number">0</span>.<span class="hljs-number">920</span>, test: <span class="hljs-number">0</span>.<span class="hljs-number">928</span></pre></div><h2 id="d2d5">7b. Regression coefficients</h2><div id="c9c8"><pre>slr.fit(X_train, y_train).<span class="hljs-built_in">int</span>ercept_</pre></div><div id="e12e"><pre><span class="hljs-deletion">-0.7525074496158393</span></pre></div><div id="8874"><pre>slr.fit<span class="hljs-comment">(X_train, y_train)</span>.coef_</pre></div><div id="5ecb"><pre><span class="hljs-attribute">array</span>([ <span class="hljs-number">0</span>.<span class="hljs-number">01902703</span>, -<span class="hljs-number">0</span>.<span class="hljs-number">15001099</span>, <span class="hljs-number">0</span>.<span class="hljs-number">37876395</span>, <span class="hljs-number">0</span>.<span class="hljs-number">77613801</span>])</pre></div><h1 id="c1ea">8. Feature Standardization, Cross-Validation, and Hyperparameter Tuning</h1><div id="082c"><pre>from sklearn<span class="hljs-selector-class">.metrics</span> import r2_score from sklearn<span class="hljs-selector-class">.model_selection</span> import train_test_split X = df<span class="hljs-selector-attr">[cols_selected]</span><span class="hljs-selector-class">.iloc</span><span class="hljs-selector-attr">[:,0:4]</span><span class="hljs-selector-class">.values</span>
y = df<span class="hljs-selector-attr">[cols_selected]</span><span class="hljs-selector-attr">[<span class="hljs-string">'crew'</span>]</span>
from sklearn<span class="hljs-selector-class">.preprocessing</span> import StandardScaler sc_y = <span class="hljs-built_in">StandardScaler</span>() sc_x = <span class="hljs-built_in">StandardScaler</span>() y_std = sc_y<span class="hljs-selector-class">.fit_transform</span>(y_train<span class="hljs-selector-attr">[:, np.newaxis]</span>)<span class="hljs-selector-class">.flatten</span>()</pre></div><div id="8dec"><pre><span class="hljs-attr">train_score</span> = [] <span class="hljs-attr">test_score</span> = []</pre></div><div id="320e"><pre><span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">10</span>): X_train, X_test, y_train, y_test = <span class="hljs-built_in">train_test_split</span>( X, y, test_size=<span class="hljs-number">0.4</span>, random_state=i) y_train_std = sc_y<span class="hljs-selector-class">.fit_transform</span>(y_train<span class="hljs-selector-attr">[:, np.newaxis]</span>)<span class="hljs-selector-class">.flatten</span>() from sklearn<span class="hljs-selector-class">.preprocessing</span> import StandardScaler from sklearn<span class="hljs-selector-class">.decomposition</span> import PCA from sklearn<span class="hljs-selector-class">.linear_model</span> import LinearRegression from sklearn<span class="hljs-selector-class">.pipeline</span> import Pipeline pipe_lr = <span class="hljs-built_in">Pipeline</span>(<span class="hljs-selector-attr">[(<span class="hljs-string">'scl'</span>, StandardScaler()),(<span class="hljs-string">'pca'</span>, PCA(n_components=4)),(<span class="hljs-string">'slr'</span>, LinearRegression())]</span>) pipe_lr<span class="hljs-selector-class">.fit</span>(X_train, y_train_std) y_train_pred_std=pipe_lr<span class="hljs-selector-class">.predict</span>(X_train) y_test_pred_std=pipe_lr<span class="hljs-selector-class">.predict</span>(X_test) y_train_pred=sc_y<span class="hljs-selector-class">.inverse_transform</span>(y_train_pred_std) y_test_pred=sc_y<span class="hljs-selector-class">.inverse_transform</span>(y_test_pred_std) train_score = np<span class="hljs-selector-class">.append</span>(train_score, <span class="hljs-built_in">r2_score</span>(y_train, y_train_pred)) test_score = np<span class="hljs-selector-class">.append</span>(test_score, <span class="hljs-built_in">r2_score</span>(y_test, y_test_pred))</pre></div><figure id="7a29"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*cWadlrGTzEcMW4zI5hwLLw.png"><figcaption></figcaption></figure><h1 id="9f6e">9. Techniques of Dimensionality Reduction</h1><h2 id="4889">9a. Principal Component Analysis (PCA)</h2><div id="e196"><pre><span class="hljs-attr">train_score</span> = [] <span class="hljs-attr">test_score</span> = [] <span class="hljs-attr">cum_variance</span> = []</pre></div><div id="b46f"><pre><span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">1</span>,<span class="hljs-number">5</span>): X_train, X_test, y_train, y_test = <span class="hljs-built_in">train_test_split</span>( X, y, test_size=<span class="hljs-number">0.4</span>, random_state=<span class="hljs-number">0</span>) y_train_std = sc_y<span class="hljs-selector-class">.fit_transform</span>(y_train<span class="hljs-selector-attr">[:, np.newaxis]</span>)<span class="hljs-selector-class">.flatten</span>() from sklearn<span class="hljs-selector-class">.preprocessing</span>

Options

import StandardScaler from sklearn<span class="hljs-selector-class">.decomposition</span> import PCA from sklearn<span class="hljs-selector-class">.linear_model</span> import LinearRegression from sklearn<span class="hljs-selector-class">.pipeline</span> import Pipeline pipe_lr = <span class="hljs-built_in">Pipeline</span>(<span class="hljs-selector-attr">[(<span class="hljs-string">'scl'</span>, StandardScaler()),(<span class="hljs-string">'pca'</span>, PCA(n_components=i)),(<span class="hljs-string">'slr'</span>, LinearRegression())]</span>) pipe_lr<span class="hljs-selector-class">.fit</span>(X_train, y_train_std) y_train_pred_std=pipe_lr<span class="hljs-selector-class">.predict</span>(X_train) y_test_pred_std=pipe_lr<span class="hljs-selector-class">.predict</span>(X_test) y_train_pred=sc_y<span class="hljs-selector-class">.inverse_transform</span>(y_train_pred_std) y_test_pred=sc_y<span class="hljs-selector-class">.inverse_transform</span>(y_test_pred_std) train_score = np<span class="hljs-selector-class">.append</span>(train_score, <span class="hljs-built_in">r2_score</span>(y_train, y_train_pred)) test_score = np<span class="hljs-selector-class">.append</span>(test_score, <span class="hljs-built_in">r2_score</span>(y_test, y_test_pred)) cum_variance = np<span class="hljs-selector-class">.append</span>(cum_variance, np<span class="hljs-selector-class">.sum</span>(pipe_lr<span class="hljs-selector-class">.fit</span>(X_train, y_train)<span class="hljs-selector-class">.named_steps</span><span class="hljs-selector-attr">[<span class="hljs-string">'pca'</span>]</span>.explained_variance_ratio_))</pre></div><div id="0a2a"><pre>plt<span class="hljs-selector-class">.scatter</span>(cum_variance,train_score, <span class="hljs-selector-tag">label</span> = <span class="hljs-string">'train_score'</span>) plt<span class="hljs-selector-class">.plot</span>(cum_variance, train_score) plt<span class="hljs-selector-class">.scatter</span>(cum_variance,test_score, <span class="hljs-selector-tag">label</span> = <span class="hljs-string">'test_score'</span>) plt<span class="hljs-selector-class">.plot</span>(cum_variance, test_score) plt<span class="hljs-selector-class">.xlabel</span>(<span class="hljs-string">'cumulative variance'</span>) plt<span class="hljs-selector-class">.ylabel</span>(<span class="hljs-string">'R2_score'</span>) plt<span class="hljs-selector-class">.legend</span>() plt<span class="hljs-selector-class">.show</span>()</pre></div><figure id="e11b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*x6cFmSpdua5wyXl-qMIFCQ.png"><figcaption></figcaption></figure><p id="525e"><b>Observations from part 9a:</b></p><p id="ffcb">We observe that by increasing the number of principal components from 1 to 4, the train and test scores improve. This is because, with fewer components, there is a high bias error in the model, since the model is overly simplified. As we increase the number of principal components, the bias error will reduce, but complexity in the model increases.</p><h2 id="d71e">9b. Regularized Regression: Lasso</h2><div id="00b7"><pre>from sklearn<span class="hljs-selector-class">.model_selection</span> import train_test_split X_train, X_test, y_train, y_test = <span class="hljs-built_in">train_test_split</span>( X, y, test_size=<span class="hljs-number">0.4</span>, random_state=<span class="hljs-number">0</span>) y_train_std = sc_y<span class="hljs-selector-class">.fit_transform</span>(y_train<span class="hljs-selector-attr">[:, np.newaxis]</span>)<span class="hljs-selector-class">.flatten</span>() X_train_std = sc_x<span class="hljs-selector-class">.fit_transform</span>(X_train) X_test_std = sc_x<span class="hljs-selector-class">.transform</span>(X_test)</pre></div><div id="7dc9"><pre><span class="hljs-attribute">alpha</span> = np.linspace(<span class="hljs-number">0</span>.<span class="hljs-number">01</span>,<span class="hljs-number">0</span>.<span class="hljs-number">4</span>,<span class="hljs-number">10</span>) #lasso parameters</pre></div><div id="3c3b"><pre>from sklearn.linear_model import Lasso <span class="hljs-attribute">lasso</span> <span class="hljs-operator">=</span> Lasso(alpha<span class="hljs-operator">=</span><span class="hljs-number">0.7</span>)</pre></div><div id="095f"><pre>r2_train=<span class="hljs-selector-attr">[]</span> r2_test=<span class="hljs-selector-attr">[]</span> norm = <span class="hljs-selector-attr">[]</span> <span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">10</span>): lasso = <span class="hljs-built_in">Lasso</span>(alpha=alpha<span class="hljs-selector-attr">[i]</span>) lasso<span class="hljs-selector-class">.fit</span>(X_train_std,y_train_std) y_train_std=lasso<span class="hljs-selector-class">.predict</span>(X_train_std) y_test_std=lasso<span class="hljs-selector-class">.predict</span>(X_test_std) r2_train=np<span class="hljs-selector-class">.append</span>(r2_train,<span class="hljs-built_in">r2_score</span>(y_train,sc_y<span class="hljs-selector-class">.inverse_transform</span>(y_train_std))) r2_test=np<span class="hljs-selector-class">.append</span>(r2_test,<span class="hljs-built_in">r2_score</span>(y_test,sc_y<span class="hljs-selector-class">.inverse_transform</span>(y_test_std))) norm= np<span class="hljs-selector-class">.append</span>(norm,np<span class="hljs-selector-class">.linalg</span><span class="hljs-selector-class">.norm</span>(lasso.coef_))</pre></div><div id="f4ac"><pre>plt<span class="hljs-selector-class">.scatter</span>(alpha,r2_train,label=<span class="hljs-string">'r2_train'</span>) plt<span class="hljs-selector-class">.plot</span>(alpha,r2_train) plt<span class="hljs-selector-class">.scatter</span>(alpha,r2_test,label=<span class="hljs-string">'r2_test'</span>) plt<span class="hljs-selector-class">.plot</span>(alpha,r2_test) plt<span class="hljs-selector-class">.scatter</span>(alpha,norm,<span class="hljs-selector-tag">label</span> = <span class="hljs-string">'norm'</span>) plt<span class="hljs-selector-class">.plot</span>(alpha,norm) plt<span class="hljs-selector-class">.ylim</span>(-<span class="hljs-number">0.1</span>,<span class="hljs-number">1</span>) plt<span class="hljs-selector-class">.xlim</span>(<span class="hljs-number">0</span>,.<span class="hljs-number">43</span>) plt<span class="hljs-selector-class">.xlabel</span>(<span class="hljs-string">'alpha'</span>) plt<span class="hljs-selector-class">.ylabel</span>(<span class="hljs-string">'R2_score'</span>) plt<span class="hljs-selector-class">.legend</span>() plt<span class="hljs-selector-class">.show</span>()</pre></div><figure id="1be0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*noTJNGzpRsXvRN6W7mZ2eA.png"><figcaption></figcaption></figure><p id="2351"><b>Observations from part 9b:</b></p><p id="0a9f">We observe that as the regularization parameter alpha increases, the norm of the regression coefficients become smaller and smaller. This means more regression coefficients are forced to zero, which intend to increases bias error (oversimplification). The best value to balance bias-variance tradeoff is when alpha is kept low, say alpha = 0.1 or less.</p><h1 id="eead">Summary</h1><p id="4704">In summary, we’ve shown how a simple regression model can be built using the cruise_ship_info.csv <a href="https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size">dataset</a> for predicting the crew size for potential ship buyers. The code for this recommendation system can be found on <a href="https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size.">GitHub</a>.</p><h1 id="779e">References</h1><ol><li>Raschka, Sebastian, and Vahid Mirjalili<b>.</b> <i>Python Machine Learning, 2nd Ed</i>. Packt Publishing, 2017.</li><li>Benjamin O. Tayo, <i>Machine Learning Model for Predicting a Ships Crew Size</i>, <a href="https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size">https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size</a>.</li></ol></article></body>

Building a Machine Learning Recommendation Model from Scratch

Build a machine learning model for recommending the crew size for cruise ship buyers in Python

A Carnival cruise ship. Image source: https://www.kaleidoscopeadventures.com/product/student-cruises/.

In this tutorial, we build a regression model using the cruise_ship_info.csv dataset for recommending the crew size for potential cruise ship buyers. This tutorial will highlight important data science and machine learning concepts such as:

a) data preprocessing and variable selection

b) basic regression model building

c) hyperparameters tuning

b) model evaluation

d) techniques for dimensionality reduction

The code for building this recommender system can be found on GitHub.

1. Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. Read dataset and display columns

df=pd.read_csv("cruise_ship_info.csv")
df.head()

3. Calculate basic summary statistics for the dataset

df.describe()

4. Generate scatter pair plot

cols = ['Age', 'Tonnage', 'passengers', 'length', 'cabins','passenger_density','crew']
sns.pairplot(df[cols], size=2.0)

Observations from part 4:

1) We observe that variables are on different scales, for sample the Age variable ranges from about 16 years to 48 years, while the Tonnage variable ranges from 2 to 220. It is therefore important that when a regression model is built using these variables, variables be brought to the same scale either by standardizing or normalizing the data.

2) We also observe that the target variable ‘crew’ correlates well with 4 predictor variables, namely, ‘Tonnage’, ‘passengers’, ‘length’, and ‘cabins’.

5. Variable selection for predicting “crew” size

5a. Calculation of covariance matrix

cols = ['Age', 'Tonnage', 'passengers', 'length', 'cabins','passenger_density','crew']
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_std = stdsc.fit_transform(df[cols].iloc[:,range(0,7)].values)
cov_mat =np.cov(X_std.T)
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
hm = sns.heatmap(cov_mat,
                 cbar=True,
                 annot=True,
                 square=True,
                 fmt='.2f',
                 annot_kws={'size': 12},
                 yticklabels=cols,
                 xticklabels=cols)
plt.title('Covariance matrix showing correlation coefficients')
plt.tight_layout()
plt.show()

5b. Selecting predictor and target variables

From the covariance matrix plot above, we see that the “crew” variable correlates strongly with 4 predictor variables: “Tonnage”, “passengers”, “length, and “cabins”.

cols_selected = ['Tonnage', 'passengers', 'length', 'cabins','crew']
df[cols_selected].head()
X = df[cols_selected].iloc[:,0:4].values    # features matrix 
y = df[cols_selected]['crew'].values        # target variable

6. Data partitioning into training and testing sets

from sklearn.model_selection import train_test_split
X = df[cols_selected].iloc[:,0:4].values     
y = df[cols_selected]['crew']
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=0)

7. Building a multi-regression model

Our machine learning regression model for predicting a ship’s “crew” size can be expressed as:

from sklearn.linear_model import LinearRegression
slr = LinearRegression()
slr.fit(X_train, y_train)
y_train_pred = slr.predict(X_train)
y_test_pred = slr.predict(X_test)
plt.scatter(y_train_pred,  y_train_pred - y_train,
            c='steelblue', marker='o', edgecolor='white',
            label='Training data')
plt.scatter(y_test_pred,  y_test_pred - y_test,
            c='limegreen', marker='s', edgecolor='white',
            label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend(loc='upper left')
plt.hlines(y=0, xmin=-10, xmax=50, color='black', lw=2)
plt.xlim([-10, 50])
plt.tight_layout()
plt.legend(loc='lower right')
plt.show()

7a. Evaluation of regression model

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))
MSE train: 0.955, test: 0.889
R^2 train: 0.920, test: 0.928

7b. Regression coefficients

slr.fit(X_train, y_train).intercept_
-0.7525074496158393
slr.fit(X_train, y_train).coef_
array([ 0.01902703, -0.15001099,  0.37876395,  0.77613801])

8. Feature Standardization, Cross-Validation, and Hyperparameter Tuning

from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
X = df[cols_selected].iloc[:,0:4].values     
y = df[cols_selected]['crew']  
from sklearn.preprocessing import StandardScaler
sc_y = StandardScaler()
sc_x = StandardScaler()
y_std = sc_y.fit_transform(y_train[:, np.newaxis]).flatten()
train_score = []
test_score = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=i)
    y_train_std = sc_y.fit_transform(y_train[:, np.newaxis]).flatten()
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    from sklearn.linear_model import LinearRegression
    from sklearn.pipeline import Pipeline
    pipe_lr = Pipeline([('scl', StandardScaler()),('pca', PCA(n_components=4)),('slr', LinearRegression())])
    pipe_lr.fit(X_train, y_train_std)
    y_train_pred_std=pipe_lr.predict(X_train)
    y_test_pred_std=pipe_lr.predict(X_test)
    y_train_pred=sc_y.inverse_transform(y_train_pred_std)
    y_test_pred=sc_y.inverse_transform(y_test_pred_std)
    train_score = np.append(train_score, r2_score(y_train, y_train_pred))
    test_score = np.append(test_score, r2_score(y_test, y_test_pred))

9. Techniques of Dimensionality Reduction

9a. Principal Component Analysis (PCA)

train_score = []
test_score = []
cum_variance = []
for i in range(1,5):
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=0)
    y_train_std = sc_y.fit_transform(y_train[:, np.newaxis]).flatten()
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    from sklearn.linear_model import LinearRegression
    from sklearn.pipeline import Pipeline
    pipe_lr = Pipeline([('scl', StandardScaler()),('pca', PCA(n_components=i)),('slr', LinearRegression())])
    pipe_lr.fit(X_train, y_train_std)
    y_train_pred_std=pipe_lr.predict(X_train)
    y_test_pred_std=pipe_lr.predict(X_test)
    y_train_pred=sc_y.inverse_transform(y_train_pred_std)
    y_test_pred=sc_y.inverse_transform(y_test_pred_std)
    train_score = np.append(train_score, r2_score(y_train, y_train_pred))
    test_score = np.append(test_score, r2_score(y_test, y_test_pred))
    cum_variance = np.append(cum_variance, np.sum(pipe_lr.fit(X_train, y_train).named_steps['pca'].explained_variance_ratio_))
plt.scatter(cum_variance,train_score, label = 'train_score')
plt.plot(cum_variance, train_score)
plt.scatter(cum_variance,test_score, label = 'test_score')
plt.plot(cum_variance, test_score)
plt.xlabel('cumulative variance')
plt.ylabel('R2_score')
plt.legend()
plt.show()

Observations from part 9a:

We observe that by increasing the number of principal components from 1 to 4, the train and test scores improve. This is because, with fewer components, there is a high bias error in the model, since the model is overly simplified. As we increase the number of principal components, the bias error will reduce, but complexity in the model increases.

9b. Regularized Regression: Lasso

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=0)
y_train_std = sc_y.fit_transform(y_train[:, np.newaxis]).flatten()
X_train_std = sc_x.fit_transform(X_train)
X_test_std = sc_x.transform(X_test)
alpha = np.linspace(0.01,0.4,10) #lasso parameters
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.7)
r2_train=[]
r2_test=[]
norm = []
for i in range(10):
    lasso = Lasso(alpha=alpha[i])
    lasso.fit(X_train_std,y_train_std)
    y_train_std=lasso.predict(X_train_std)
    y_test_std=lasso.predict(X_test_std)
    r2_train=np.append(r2_train,r2_score(y_train,sc_y.inverse_transform(y_train_std)))
    r2_test=np.append(r2_test,r2_score(y_test,sc_y.inverse_transform(y_test_std)))
    norm= np.append(norm,np.linalg.norm(lasso.coef_))
plt.scatter(alpha,r2_train,label='r2_train')
plt.plot(alpha,r2_train)
plt.scatter(alpha,r2_test,label='r2_test')
plt.plot(alpha,r2_test)
plt.scatter(alpha,norm,label = 'norm')
plt.plot(alpha,norm)
plt.ylim(-0.1,1)
plt.xlim(0,.43)
plt.xlabel('alpha')
plt.ylabel('R2_score')
plt.legend()
plt.show()

Observations from part 9b:

We observe that as the regularization parameter alpha increases, the norm of the regression coefficients become smaller and smaller. This means more regression coefficients are forced to zero, which intend to increases bias error (oversimplification). The best value to balance bias-variance tradeoff is when alpha is kept low, say alpha = 0.1 or less.

Summary

In summary, we’ve shown how a simple regression model can be built using the cruise_ship_info.csv dataset for predicting the crew size for potential ship buyers. The code for this recommendation system can be found on GitHub.

References

  1. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
  2. Benjamin O. Tayo, Machine Learning Model for Predicting a Ships Crew Size, https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size.
Data Science
Machine Learning
Python
Artificial Intelligence
Featured
Recommended from ReadMedium