Machine Learning, Programming
Fully Explained Linear Regression with Python
How the regression problem is solved with a real-life example.
Simple linear regression is used for predictive analysis and making inferences. In this type, there is one independent and one dependent variable. Whenever there are cause and effect in modeling, we go to regression analysis. The performance is more accurate in real-time analysis when we use factor analysis techniques. The regression analysis fundamentals use in supervised machine learning. Three important things to be noticed here:
- We need data to get analysis on, for the whole population it is a little bit very tedious task, so we get sample data for analysis.
- After getting data, we need to design a model so that it works for the whole population.
- After modeling, we can make predictions for the population.
Earlier, we noticed that it is cause and effect type modeling. So, where we get our linear regression. Linear means increasing the effect as we increase the cause so that both are changing parallel. We need some mathematical to get linear predictions. So we use an equation of a straight line to get the properties of a linear straight line.
The equation of a straight line is
y = mx + b
Here,
Y is a dependent variable ( outcome ) or predicted variable.
X is an independent variable.
M is a slope, or we can say gradient.
B is a value intercept on the y-axis.
The Y is a function of X. Regression model is a linear approximation. For a good prediction, we need to find the B and M.
Example:
Suppose we have fitness data of Energy and kilometers covered.
We need to find the B and M. the Formula to find these values is given below:
M = number of samples * ( XY sum- X sum*Y sum) / Number of samples * (X square sum- X sum squared)
B = Y sum- M * X sum/number of samples
The image shows these values.
After calculating the value, M comes to 1.89, and the value of B comes to 0.667. From these values, we can get a prediction from the formula.
Y = 1.89*X + 0.667
After checking some X values to predict the Kilometers. An example is shown below:
We got our model but a simple technique. Let’s check with python if we get the same values or not.
#import all the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
Read the excel file
df = pd.read_excel("fitness.xlsx")
Use the describe function to see the stats.
Split the data into x and y.
y = df['Y1']
x = df['X1']
#plot the scatter plot between them
plt.scatter(x,y)
plt.xlabel('Energy', fontsize =20)
plt.ylabel('Kms Covered', fontsize =20)
plt.show()
We almost get a linear correlation.
Now fit the OLS model on our data.
x_new = sm.add_constant(x)
output = sm.OLS(y, x_new).fit()
output.summary()
#output:
coef
---------------------------------
const 0.667
x1 1.89
After getting the summary, we get the same values. Here we use the statsmodels, which an excellent library for statistics and inferences.
After the fit, the OLS model lets check the scatter plot with the fitted line.
plt.scatter(x1,y)
y_pred = 1.89*X + 0.667
fig = plt.plot(x1, y_pred, lw = 5, c='red', label='regression line'
plt.xlabel('Energy', fontsize =20)
plt.ylabel('Kms Covered', fontsize =20)
plt.show()
Here is the scatter plot after the fit of the best line.
Conclusion:
The OLS gives a simple linear regression approximation, and statsmodel gives a wonderful insight into the data's statistics.
I hope you like the article. Reach me on my LinkedIn and twitter.
Recommended Articles
1. 15 Most Usable NumPy Methods with Python 2. NumPy: Linear Algebra on Images 3. Exception Handling Concepts in Python 4. Pandas: Dealing with Categorical Data 5. Hyper-parameters: RandomSeachCV and GridSearchCV in Machine Learning 6. Fully Explained Linear Regression with Python 7. Fully Explained Logistic Regression with Python 8. Data Distribution using Numpy with Python 9. 40 Most Insanely Usable Methods in Python 10. 20 Most Usable Pandas Shortcut Methods in Python