avatarBenjamin Obi Tayo Ph.D.

Summary

The web content discusses the development and evaluation of a simple linear regression model using gradient descent to predict housing prices, emphasizing the importance of hyperparameter tuning for model performance.

Abstract

The article delves into the creation of a linear regression model for predicting continuous variables, specifically housing prices, using the Housing dataset from the UCI machine learning repository. It outlines the implementation of a gradient descent optimizer in Python, detailing the class structure, methods for fitting the model, predicting outcomes, and calculating the R-square value. The author demonstrates feature selection, standardization, and the impact of different learning rates on the model's performance, highlighting the significance of choosing the right hyperparameters to improve the model's predictive accuracy. The article concludes with a discussion on the necessity of understanding the model's intricacies to avoid a black-box approach and ensure a reliable regression analysis.

Opinions

  • The author advocates for a thorough understanding of machine learning models, including knowledge of hyperparameters, to achieve optimal performance.
  • It is suggested that blindly using machine learning models without comprehending their underlying mechanisms can lead to suboptimal and potentially misleading results.
  • The article emphasizes the role of hyperparameter tuning, particularly the learning rate (eta), in the success of a regression model.
  • The author implies that a balance must be struck when selecting the learning rate; values that are too high or too low can negatively affect the model's predictive power.
  • The use of R-square values is presented as a key metric for evaluating the goodness of fit for the regression model.
  • Visual representations, such as scatter plots and heatmaps, are used to illustrate feature correlations and model fits, indicating the author's preference for visual aids in data analysis.
  • The article encourages the use of standardization techniques to prepare data for regression analysis, ensuring that all features contribute equally to the result.

Scrutinizing Machine Learning Regression Models | Towards AI

Bad and Good Regression Analysis

Regression models are the most popular machine learning models. Regression models are used for predicting target variables on a continuous scale. Regression models find applications in almost every field of study, and as a result, it is one of the most widely used machine learning models. This article will discuss good and bad practices in building a regression model.

We will build a simple linear regression model (no distinction between inliers and outliers which can be handled using more robust regularized regression models such as Lasso regression), then use it to predict house prices using the Housing dataset. We use the output from the model to highlight good and bad practices in regression analysis.

More information about the Housing dataset can be found from the UCI machine learning repository. Jupyter notebook containing all the code can be found on GitHub.

Linear Regression Estimator Using Gradient Descent

In a previous article, we’ve discussed how a simple linear regression model can be built for predicting a continuous outcome variable (y) using a one-dimensional dataset containing a single feature (X): https://readmedium.com/machine-leaning-python-linear-regression-estimator-using-gradient-descent-b0b2c496e463

Implementing a Simple Linear Regression Estimator in Python

class GradientDescent(object):
    """Gradient descent optimizer.
    Parameters
    ------------
    eta : float
        Learning rate (between 0.0 and 1.0)
    n_iter : int
        Passes over the training dataset.
        
    Attributes
    -----------
    w_ : 1d-array
        Weights after fitting.
    errors_ : list
        Error in every epoch.
        
    Methods
    -----------
    fit(X,y): fit the linear regression model using the data.
    predict(X): Predict outcome for samples in X.
    Rsquare(X,y): Returns the R^2 value.
    """
    def __init__(self, eta=0.01, n_iter=10):
        self.eta = eta
        self.n_iter = n_iter
        
    def fit(self, X, y):
        """Fit the data.
        
        Parameters
        ----------
        X : {array-like}, shape = [n_points]
        Independent variable or predictor.
        y : array-like, shape = [n_points]
        Outcome of prediction.
        Returns
        -------
        self : object
        """
        self.w_ = np.zeros(2)
        self.errors_ = []
        
        for i in range(self.n_iter):
            errors = 0
            for j in range(X.shape[0]):
                self.w_[1:] += self.eta*X[j]*(y[j] - self.w_[0] - self.w_[1]*X[j])
                self.w_[0] += self.eta*(y[j] - self.w_[0] - self.w_[1]*X[j])
                errors += 0.5*(y[j] - self.w_[0] - self.w_[1]*X[j])**2
            self.errors_.append(errors)
        return self
    def predict(self, X):
        """Return predicted y values"""
        return self.w_[0] + self.w_[1]*X 
    
    def Rsquare(self, X,y):
        """Return the Rsquare value"""
        y_hat = self.predict(X)
        return  1-((y_hat - y)**2).sum()/((y-np.mean(y))**2).sum()

Application of Python Estimator: Predicting Housing Prices

a) Import Necessary Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
np.set_printoptions(precision=4)

b) Exploring the Housing Dataset

df = pd.read_csv('https://raw.githubusercontent.com/rasbt/'
                 'python-machine-learning-book-2nd-edition'
                 '/master/code/ch10/housing.data.txt',
                 header=None,
                 sep='\s+')
df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 
              'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
              'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df.head()

c) Feature Selection and Standardization

cols = ['LSTAT', 'INDUS', 'NOX', 'RM', 'MEDV']
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_std = stdsc.fit_transform(df[cols].iloc[:,range(0,5)].values)
# Evaluate the covariance matrix
cov_mat =np.cov(X_std.T)
hm = sns.heatmap(cov_mat,
                 cbar=True,
                 annot=True,
                 square=True,
                 fmt='.2f',
                 annot_kws={'size': 15},
                 yticklabels=cols,
                 xticklabels=cols)
plt.tight_layout()
plt.savefig('images/10_04.png', dpi=300)
plt.show()
Covariance matrix showing correlations between features.

Since we are interested in predicting MEDV (median value of the homes), we see that the strongest correlation is found with RM (average number of rooms per dwelling). So in our model, we shall use RM as the predictor variable, and MEDV as the target variable:

X=X_std[:,3] # we use RM as our predictor variable
y=X_std[:,4] # we use MEDV as our target variable

d) Calculate R-square Values for Different Learning Rates

[GradientDescent(eta=k, n_iter=100).fit(X,y).Rsquare(X,y) for k in [0.1,0.01,0.001,0.0001,0.00001,0.000001]]

We obtained the following output:

[0.297,0.312,0.482,0.483,0.308,0.047]

e) Fit, Predict, and Hyperparameter Tuning

np.set_printoptions(precision=1)
# plot with various axes scales
plt.figure(figsize=(10,8))
# fig 1
plt.subplot(231)
plt.scatter(X,y,c='steelblue', edgecolor='white', s=70,label='data')
plt.plot(X, GradientDescent(eta=0.1, n_iter=100).fit(X,y).predict(X),color='black', lw=2,label='fit')
plt.title('$\eta = 10^{-1}, R^2 = 0.297$ ',size=14)
plt.grid(False)
plt.xlabel('RM (std)',size=14)
plt.ylabel('MEDV (std)',size=14)
plt.legend()
# fig 2
plt.subplot(232)
plt.scatter(X,y,c='steelblue', edgecolor='white', s=70)
plt.plot(X, GradientDescent(eta=0.01, n_iter=100).fit(X,y).predict(X),color='black', lw=2)
plt.title('$\eta = 10^{-2},R^2 = 0.312$',size=14)
plt.grid(False)
plt.xlabel('RM (std)',size=14)
plt.ylabel('MEDV (std)',size=14)
# fig 3
plt.subplot(233)
plt.scatter(X,y,c='steelblue', edgecolor='white', s=70)
plt.plot(X,GradientDescent(eta=0.001, n_iter=100).fit(X,y).predict(X),color='black', lw=2)
plt.title('$\eta =10^{-3},R^2 = 0.482$',size=14)
plt.grid(False)
plt.xlabel('RM (std)',size=14)
plt.ylabel('MEDV (std)',size=14)
# fig 4
plt.subplot(234)
plt.scatter(X,y,c='steelblue', edgecolor='white', s=70)
plt.plot(X, GradientDescent(eta=0.0001, n_iter=100).fit(X,y).predict(X),color='black', lw=2)
plt.title('$\eta = 10^{-4}, R^2 = 0.483$ ',size=14)
plt.grid(False)
plt.xlabel('RM (std)',size=14)
plt.ylabel('MEDV (std)',size=14)
# fig 5
plt.subplot(235)
plt.scatter(X,y,c='steelblue', edgecolor='white', s=70)
plt.plot(X, GradientDescent(eta=0.00001, n_iter=100).fit(X,y).predict(X),color='black', lw=2)
plt.title('$\eta = 10^{-5},R^2 = 0.308$',size=14)
plt.grid(False)
plt.xlabel('RM (std)',size=14)
plt.ylabel('MEDV (std)',size=14)
# fig 6
plt.subplot(236)
plt.scatter(X,y,c='steelblue', edgecolor='white', s=70)
plt.plot(X,GradientDescent(eta=0.000001, n_iter=100).fit(X,y).predict(X),color='black', lw=2)
plt.title('$\eta =10^{-6},R^2 = 0.047$',size=14)
plt.grid(False)
plt.xlabel('RM (std)',size=14)
plt.ylabel('MEDV (std)',size=14)
plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.4, wspace=0.35)
plt.show()

Here is the output:

Regression analysis using different values of the learning rate parameter.

General Remarks and Conclusion

Using our simple regression model, we could see that the reliability of our model depends on hyperparameter tuning. If we just pick a random value for the learning rate such as eta = 0.1, this would lead to a poor model. Choosing a value for eta too small, such as eta = 0.00001 also produces a bad model. Our analysis shows that the best choice is when eta = 0.0001, as can be seen from the R-square values.

What makes the difference between a good and a bad regression analysis depends on one’s ability to understand all the details of the model including knowledge about different hyperparameters and how these parameters can be tuned in order to obtain the model with the best performance. Using any machine learning model as a black box without fully understanding the intricacies of the model will lead to a falsified model.

References:

  1. “Machine Learning: Python Linear Regression Estimator Using Gradient Descent”, Benjamin O. Tayo (https://readmedium.com/machine-leaning-python-linear-regression-estimator-using-gradient-descent-b0b2c496e463).

2. “Python Machine Learning”, 2nd Edition, Sebastian Raschka.

3. UCI machine learning repository at https://archive.ics.uci.edu/ml/machine-learning-databases/housing/.

4. Jupyter notebook containing the entire code used in this article is found here: https://github.com/bot13956/python-linear-regression-estimator.

Machine Learning
Linear Regression
Data Science
Python
Predictive Analytics
Recommended from ReadMedium