avatarRenu Khandelwal

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

6025

Abstract

ean_1)/std_1 <span class="hljs-keyword">if</span> <span class="hljs-built_in">np</span>.<span class="hljs-built_in">abs</span>(z_score) > threshold: outliers.<span class="hljs-built_in">append</span>(y) <span class="hljs-built_in">return</span> outliers</pre></div><div id="b583"><pre>outlier_data = detect_outlier(<span class="hljs-name">data_set</span>[<span class="hljs-string">"Humidity"</span>]) print (<span class="hljs-name">outlier_data</span>)</pre></div><figure id="9305"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ML2Ij4ka8gNvH5M_OPZXFg.png"><figcaption>outlier values in Humidity dependent variable with more than 3 standard deviation</figcaption></figure><p id="cb53">we will remove these rows from the dataset to have a clean dataset for regression.</p><p id="8d46">so we search for all values for Humidity in data_set with values >0.15 and create a new data set data_set_clean</p><div id="c5b2"><pre>dat<span class="hljs-built_in">a_set</span>clean = dat<span class="hljs-built_in">a_set</span>[dat<span class="hljs-built_in">a_set</span>[<span class="hljs-string">"Humidity"</span>]><span class="hljs-number">0.15</span>]</pre></div><p id="2ea2">let’s again plot the data between Temperature and Humidity to check if we have any more outliers</p><div id="229a"><pre>sns.regplot(<span class="hljs-attribute">x</span>=data_set_clean[<span class="hljs-string">"Temperature (C)"</span>], <span class="hljs-attribute">y</span>=data_set_clean[<span class="hljs-string">"Humidity"</span>])</pre></div><figure id="574e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*uJWCWG_LnkgEGxzrAjsPYQ.png"><figcaption>Scatter plot between temperature and humidity after removing outliers</figcaption></figure><p id="2093">now let’s draw a scatter plot between Temp and Apparent temp</p><div id="d91b"><pre>sns.regplot(<span class="hljs-attribute">x</span>=data_set[<span class="hljs-string">"Temperature (C)"</span>], <span class="hljs-attribute">y</span>=data_set[<span class="hljs-string">"Apparent Temperature (C)"</span>])</pre></div><figure id="1054"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*4iBfPXxFCa3yV7FGFhS8rw.png"><figcaption>scatter plot between Temperature and Apparent Temperature</figcaption></figure><p id="3c6c">Looks like a strong positive correlation between temp and apparent temperature this seems obvious too.</p><p id="c983">we now draw a scatter plot between Temp and Visibility.</p><div id="22f6"><pre>sns.regplot(<span class="hljs-attribute">x</span>=data_set[<span class="hljs-string">"Temperature (C)"</span>], <span class="hljs-attribute">y</span>=data_set[<span class="hljs-string">"Visibility (km)"</span>])</pre></div><figure id="caf2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Xoc29ATOnpWp_mnSyhL0Zg.png"><figcaption></figcaption></figure><p id="5608">we don’t see a strong relationship between temperature and visibility so we can drop visibility.</p><p id="7bd0">Here X is our independent variable with Humidity and Apparent Temperature.</p><p id="4ab4">Y is our dependent variable with temperature that we are trying to first learn and then planning to predict</p><div id="1fe9"><pre>y= data_set_clean.iloc<span class="hljs-comment">[:,<span class="hljs-comment">[1]</span>]</span> X= data_set_clean.iloc<span class="hljs-comment">[:,<span class="hljs-comment">[2,3]</span>]</span></pre></div><p id="4bb9">printing one row of X to see our independent variables</p><div id="3711"><pre><span class="hljs-attribute">X</span>.head(<span class="hljs-number">1</span>)</pre></div><figure id="18b2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*0LgYCTXtwj1Pmdvmdn5hAA.png"><figcaption></figcaption></figure><p id="c15b">Splitting the dataset into the training set and test set with a 80:20 ratio</p><div id="b1fa"><pre><span class="hljs-attribute">from</span> sklearn.cross_validation import train_test_split <span class="hljs-attribute">X_train</span>, X_test, y_train, y_test = train_test_split(X, y, test_size = <span class="hljs-number">0</span>.<span class="hljs-number">2</span>, random_state = <span class="hljs-number">0</span>)</pre></div><p id="f9e9">we now use sklearn library for linear_models to fit our training data for Multiple Linear regression.</p><p id="7e0f">we import the library <b>LinearRegression </b>from <b>sklearn.linear</b>model. Create a regressor object and then try and fit the training data</p><div id="11f4"><pre>from sklearn<span class="hljs-selector-class">.linear_model</span> import LinearRegression regressor =<span class="hljs-built_in">LinearRegression</span>() regressor<span class="hljs-selector-class">.fit</span>(X_train, y_train)</pre></div><figure id="5cee"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*t92zMTx1vcTjzK5a1rF3FA.png"><figcaption>Fitting training data</figcaption></figure><p id="8c17">let’s print the different values of regressor and understand what do they mean</p><div id="6f08"><pre>regressor.coef</pre></div><figure id="3611"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*7adO7fQ88vy3Ya0v4PsBpQ.png"><figcaption>Coefficients of our independent variables</figcaption></figure><p id="52ce">remember the Linear regression equation</p><figure id="8ad3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Z6uC8yJYVNO7r8mDHphjrw.png"><figcaption>Linear Regression equation</figcaption></figure><ul><li>0.857 is the b1 where x1 is apparent temperature</li><li>-2.648 is b2 where x2 is humidity</li></ul><p id="a234">let’s find the intercept b0</p><div id="4ea3"><pre>regressor.<span class="hljs-built_in">int</span>ercept</pre></div><figure id="6bf7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*_WpsicpBJCEUcpTT8St9FQ.png"><figcaption>Intercept for the Linear Regression</figcaption></figure><p id="0781">let’s now write down the linear Regression equation for predicting temperature based on the trained dataset</

Options

p><p id="7b57"><b>temperature = 4.58 + (apparent temperature * 0.857) + (-2.648 * humidity)</b></p><p id="b7ee">we can now predict the temperature for our test dataset</p><div id="b8f4"><pre><span class="hljs-attr">y_pred</span> = regressor.predict(X_test)</pre></div><p id="baff"><i>How do we know measure the fitness of our model ?</i></p><p id="4d71"><b>A good fitting model is one where the difference between the actual or observed values and predicted values based on the model are small and unbiased.</b></p><p id="b63b">so if some statistics tells us that the difference between the actual and predicted values are small then we know that the model we built is a good one.</p><p id="b58e">There are few statistical tools that comes to our help like <b>coefficient of determination also called as R².</b></p><p id="c4ac"><i>What is r-square?</i></p><p id="57fa">It is also called as coefficient of determination.</p><p id="fc88"><b>r² gives us a measure of how well the actual outcomes are replicated by the model or the regression line</b>. This is based on the total variation of prediction explained by the model.</p><figure id="3498"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*r5hkaA9KU7UsxI9nS97fog.png"><figcaption>r-square</figcaption></figure><p id="84fe">R² is always between 0 and 1 or between 0% to 100%.</p><p id="c185">A value of 1 means that the model explains all the variation in predicted variable around its mean.</p><p id="f54d"><b><i>Sum square of errors(SSE) or Residuals, how far did we predict a value when compared to the actual value</i></b></p><h2 id="a2dd">SSE = Actual value -Predicted value</h2><p id="d666"><b><i>Sum square of total (SST), how far is the actual value when compared to the mean value</i></b></p><h2 id="4600">SST = Actual value -Mean value</h2><p id="d97b"><b><i>Sum square of Regression(SSR), how far is the actual value when compared to the mean value</i></b>S</p><h2 id="65f7">SSR = Predicted value -mean value</h2><figure id="589b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*83IkrQC90VbNRc-x1pcjGQ.png"><figcaption>r-square</figcaption></figure><p id="c51e">If the error in prediction is low then SSE will be low and r-square will be close to 1.</p><figure id="5065"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*XtRWXKe1W6ZUiXRSMFK0jQ.png"><figcaption></figcaption></figure><p id="642e">A caution of note here, when we add more independent variables, r² gets higher value. R² value keeps on increasing with addition of more independent variables even though they may not really have a significant impact on the predictions. This does not help us to build a good model.</p><p id="5800">To overcome this issue, we use <b>Adjusted R²</b>. Adjusted r² penalizes the model for every addition of an insignificant independent variable.</p><div id="6887"><pre>regressor.score<span class="hljs-comment">(X,y)</span></pre></div><figure id="8382"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*PrJrWUlu-XIGLCdXwaEP-w.png"><figcaption>r-square for the dataset</figcaption></figure><p id="b1f8">A value close to 1 for r² means a good fit.</p><p id="384f">we can also calculate <b>root mean square error </b>also referred as <b>RMSE</b>.</p><h2 id="422b">Root Mean Square Error</h2><p id="9d95">shows the variation between the predicted and the actual value. since the difference between predicted and actual values can be positive and negative, to offset that difference we take the square of the difference between predicted and actual value.</p><p id="7a3b">Step 1: Find the difference between predicted and actual value for every observation and square the value and add them</p><p id="166a">Sum of all observation (predicted value — actual value)²</p><p id="b566">Step2: divide the sum by number of observation</p><p id="3f76">Sum of all observation (predicted value — actual value)²/number of observation</p><p id="1d17">Step 3: Take the square root the value from step 2</p><figure id="8163"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Ajy0KkZG3MFYgGr0ro5PzA.png"><figcaption>Root Mean Square Error — RMSE</figcaption></figure><div id="e6f2"><pre><span class="hljs-keyword">from</span> sklearn <span class="hljs-keyword">import</span> metrics <span class="hljs-keyword">import</span> <span class="hljs-built_in">math</span> <span class="hljs-built_in">print</span>(<span class="hljs-built_in">math</span>.sqrt(metrics.mean_squared_error(y_test, y_pred)))</pre></div><figure id="6b98"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*4LoAB2icSVOXYcFRSE82WQ.png"><figcaption>RMSE — root mean square error</figcaption></figure><p id="adf9">Another library that we can use is <b>statsmodel</b></p><div id="1771"><pre><span class="hljs-keyword">import</span> statsmodels.api <span class="hljs-keyword">as</span> sm </pre></div><p id="fb88">In our linear regression equation x0 for the intercept is always 1. we have to explicitly create the variable x0.</p><div id="f8c9"><pre>ones_1 <span class="hljs-operator">=</span>[<span class="hljs-number">1</span>] * X.count() X[<span class="hljs-string">"b0"</span>]<span class="hljs-operator">=</span>ones_1</pre></div><p id="dfec">we now use OLS -ordinary least square to find the best fitting regression line</p><div id="0a30"><pre>model = sm.OLS<span class="hljs-comment">(y_pred,X_test)</span>.fit<span class="hljs-comment">()</span></pre></div><p id="7ec1">we then print the summary of the different statistics that help us evaluate our model</p><div id="504e"><pre><span class="hljs-keyword">model</span>.summary()</pre></div><figure id="d4da"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*DKcMrb1W1fgURQ15OgZQ0Q.png"><figcaption>Statistics Summary for OLS</figcaption></figure><p id="20c3">Here we get a slightly better r-square of .997 from our previous value of .987</p><p id="a623">Hope this article helped you to get a good understanding of Linear regression.</p></article></body>

Linear Regression

This article is about Linear regression and the different measures that determine the goodness of fit

Code for the article can be found at https://github.com/arshren/MachineLearning/blob/master/Python%20-%20Linear%20Regression%20-%20Predicting%20Temperature.ipynb

Let’s say you are concerned about climate change and wants to study the weather condition to know what parameters have an impact on the temperature.

we can use humidity, air pressure, wind speed to predict the temperature that day.

we will use linear regression here.

Linear regression is the simplest yet very powerful way to model linear relationship between scalar dependent and one or more independent variable.

The linear regression equation is

For our example, we will use the weather dataset at Kaggle -https://www.kaggle.com/budincsevity/szeged-weather

First we will import the required

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

let’s read the data, I have downloaded my data into a folder -D:\Machine Learning — Full\Blogs dataset and renamed the file as WeatherHist.csv

weather_data = pd.read_csv("D:\Machine Learning - Full\Blogs dataset\weatherHist.csv")

Now, we want to know what are the different columns

weather_data.head(3)
weather data — 3 rows from top with all the columns in the dataset

let’s explore the categorical variables in the dataset

weather_data.describe(include=['O'])
categorical data in weather data

We want to predict the temperature, so let’s us find out the correlation between the different variables in the dataset

weather_data.corr()
correlation between different variables in the dataset

Correlation varies between +1 to -1, +1 indicates strong positive correlation, Correlation coefficient of -1 signifies perfect negative relationship, and correlation of 0 means that no relationship exists between variables

From the table above, we see strong relationship between temperature and apparent temperature, humidity and maybe we can also include visibility.

Let’s take all the relevant attributes into a new dataset and again check the correlation

data_set=weather_data.iloc[:,[0,3,4,5,8]]
data_set.corr()
correlation of selected attributes

Let’s now visualize the data between temperature and other dependent variables

plotting a scatter plot between temp and humidity

sns.regplot(x=data_set["Temperature (C)"], y=data_set["Humidity"])
scatter plot between Temperature and Humidity

There is a negative sort of correlation between Humidity and Temperature and we also see a few outliers.

Let’s try and find the outliers so that we can remove them.

we have learnt in Inferential statistics that outliers greatly impact linear regression-POST

Below we have written a function that helps identify outliers in Humidity variable in our data set.

The way it finds outliers is based on Z score with a standard deviation greater than 3

import numpy as np
import pandas as pd
outliers=[]
def detect_outlier(data_1):
    
    threshold=3
    mean_1 = np.mean(data_1)
    std_1 =np.std(data_1)
    
    
    for y in data_1:
        z_score= (y - mean_1)/std_1 
        if np.abs(z_score) > threshold:
            outliers.append(y)
    return outliers
outlier_data = detect_outlier(data_set["Humidity"])
print (outlier_data)
outlier values in Humidity dependent variable with more than 3 standard deviation

we will remove these rows from the dataset to have a clean dataset for regression.

so we search for all values for Humidity in data_set with values >0.15 and create a new data set data_set_clean

data_set_clean = data_set[data_set["Humidity"]>0.15]

let’s again plot the data between Temperature and Humidity to check if we have any more outliers

sns.regplot(x=data_set_clean["Temperature (C)"], y=data_set_clean["Humidity"])
Scatter plot between temperature and humidity after removing outliers

now let’s draw a scatter plot between Temp and Apparent temp

sns.regplot(x=data_set["Temperature (C)"], y=data_set["Apparent Temperature (C)"])
scatter plot between Temperature and Apparent Temperature

Looks like a strong positive correlation between temp and apparent temperature this seems obvious too.

we now draw a scatter plot between Temp and Visibility.

sns.regplot(x=data_set["Temperature (C)"], y=data_set["Visibility (km)"])

we don’t see a strong relationship between temperature and visibility so we can drop visibility.

Here X is our independent variable with Humidity and Apparent Temperature.

Y is our dependent variable with temperature that we are trying to first learn and then planning to predict

y= data_set_clean.iloc[:,[1]]
X= data_set_clean.iloc[:,[2,3]]

printing one row of X to see our independent variables

X.head(1)

Splitting the dataset into the training set and test set with a 80:20 ratio

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

we now use sklearn library for linear_models to fit our training data for Multiple Linear regression.

we import the library LinearRegression from sklearn.linear_model. Create a regressor object and then try and fit the training data

from sklearn.linear_model import LinearRegression
regressor =LinearRegression()
regressor.fit(X_train, y_train)
Fitting training data

let’s print the different values of regressor and understand what do they mean

regressor.coef_
Coefficients of our independent variables

remember the Linear regression equation

Linear Regression equation
  • 0.857 is the b1 where x1 is apparent temperature
  • -2.648 is b2 where x2 is humidity

let’s find the intercept b0

regressor.intercept_
Intercept for the Linear Regression

let’s now write down the linear Regression equation for predicting temperature based on the trained dataset

temperature = 4.58 + (apparent temperature * 0.857) + (-2.648 * humidity)

we can now predict the temperature for our test dataset

y_pred = regressor.predict(X_test)

How do we know measure the fitness of our model ?

A good fitting model is one where the difference between the actual or observed values and predicted values based on the model are small and unbiased.

so if some statistics tells us that the difference between the actual and predicted values are small then we know that the model we built is a good one.

There are few statistical tools that comes to our help like coefficient of determination also called as R².

What is r-square?

It is also called as coefficient of determination.

r² gives us a measure of how well the actual outcomes are replicated by the model or the regression line. This is based on the total variation of prediction explained by the model.

r-square

R² is always between 0 and 1 or between 0% to 100%.

A value of 1 means that the model explains all the variation in predicted variable around its mean.

Sum square of errors(SSE) or Residuals, how far did we predict a value when compared to the actual value

SSE = Actual value -Predicted value

Sum square of total (SST), how far is the actual value when compared to the mean value

SST = Actual value -Mean value

Sum square of Regression(SSR), how far is the actual value when compared to the mean valueS

SSR = Predicted value -mean value

r-square

If the error in prediction is low then SSE will be low and r-square will be close to 1.

A caution of note here, when we add more independent variables, r² gets higher value. R² value keeps on increasing with addition of more independent variables even though they may not really have a significant impact on the predictions. This does not help us to build a good model.

To overcome this issue, we use Adjusted R². Adjusted r² penalizes the model for every addition of an insignificant independent variable.

regressor.score(X,y)
r-square for the dataset

A value close to 1 for r² means a good fit.

we can also calculate root mean square error also referred as RMSE.

Root Mean Square Error

shows the variation between the predicted and the actual value. since the difference between predicted and actual values can be positive and negative, to offset that difference we take the square of the difference between predicted and actual value.

Step 1: Find the difference between predicted and actual value for every observation and square the value and add them

Sum of all observation (predicted value — actual value)²

Step2: divide the sum by number of observation

Sum of all observation (predicted value — actual value)²/number of observation

Step 3: Take the square root the value from step 2

Root Mean Square Error — RMSE
from sklearn import metrics
import math
print(math.sqrt(metrics.mean_squared_error(y_test, y_pred)))
RMSE — root mean square error

Another library that we can use is statsmodel

import statsmodels.api as sm

In our linear regression equation x0 for the intercept is always 1. we have to explicitly create the variable x0.

ones_1 =[1] * X.count()
X["b0"]=ones_1

we now use OLS -ordinary least square to find the best fitting regression line

model = sm.OLS(y_pred,X_test).fit()

we then print the summary of the different statistics that help us evaluate our model

model.summary()
Statistics Summary for OLS

Here we get a slightly better r-square of .997 from our previous value of .987

Hope this article helped you to get a good understanding of Linear regression.

Data Science
Linear Regression
Root Mean Square Error
R Square
Python
Recommended from ReadMedium