The article provides a comprehensive guide to understanding and implementing simple linear regression, using height and weight data as a practical example to illustrate the process.
Abstract
The article "Simplifying Linear Regression — A Beginner’s Guide with a real-world Practical Example" demystifies linear regression by using a classic example of predicting weight from height. It begins by establishing linear regression as a foundational machine learning algorithm, akin to finding the best line through scattered data points. The author then delves into exploring the dataset, ensuring data integrity, and examining data distribution before proceeding to model building. Three approaches to finding the best-fit line are discussed: Ordinary Least Squares (OLS), Gradient Descent, and using the scikit-learn library. Each method is explained with mathematical insights and corresponding Python code snippets. The article concludes with the presentation of the final regression formula derived from the data, a discussion of the assumptions underlying linear regression, and encouragement for readers to continue exploring data science.
Opinions
The author views linear regression as the "ABC of analytics," emphasizing its importance as a fundamental concept in machine learning.
There is a clear preference for hands-on learning, as the author encourages building algorithms from scratch to gain a deeper understanding before using pre-built libraries like scikit-learn.
The article promotes the use of Python for machine learning tasks, highlighting its simplicity and the robustness of its libraries.
The author suggests that the beauty of data analysis lies in simplicity and using the right model for the job, not necessarily complex models.
The importance of understanding and validating the assumptions of linear regression is stressed to ensure the model's reliability and accuracy.
The author encourages continuous learning and exploration in the field of data science, inviting readers to follow their curiosity and stay engaged with the community.
Simplifying Linear Regression — A Beginner’s Guide with a real-world Practical Example
Dive into Data Science with a Practical Height-Weight Prediction Model
Image by the author. ML Basics. Simple Linear Regression Example
Linear regression is one of the most fundamental algorithms in the machine learning universe, and it’s like the ABC of analytics.
Think of it like making the best line through a scatter of stars in the sky, where each star represents a data point.
Today, we’re diving into real-world ML with one of the classic examples of linear dependency — height and weight 👇🏻
Understanding the Basics
Simple linear regression is the most basic ML algorithm. It’s where most of us start our journey into data modeling.
Imagine plotting a graph of height and weight, you’ll likely see a line that roughly shows as height increases, so does weight.
That’s linear regression in a nutshell:
Finding the straight line — mathematically speaking — that best fits our data.
Image by the author. Simple Linear Regression.
So let’s see how we can do this…
#1. Exploring the Data - A Peek Before the Leap
Before diving into the analysis, let’s get friendly with our data.
Picture a table with three columns: gender, height, and weight.
Screenshot of the dataset.
A quick glance with df.info() in Python tells us we have thousands of entries and, importantly, no missing values.
And what about the distribution of our data?
Imagine two bell-shaped curves, one for height and one for weight, both as symmetrical as a butterfly’s wings — that’s our normal distribution right there.
Image by the author. Normal distribution for both weight and height.
#2. How to Find Our Best-Fit Line?
To obtain our best-fiting line, there are different approaches. Today, most of us would take advantage of scikit-learn and apply the pre-build Linear Regression algorithm.
However, for this very first time let’s try three different approaches:
2 DIY — generating the algorithms from scratch all by ourselves.
1 final approach taking advantage of scikit-learn.
Approach #1: Ordinary Least Squares (OLS)
The objective of Ordinary Least Squares (OLS) is to determine the optimal coefficients A and B by minimizing the aggregate of squared prediction errors (MSE) — that remember, it is the cost function of Linear Regression.
Leveraging calculus, we exploit the properties of partial derivatives to locate the minima of the cost function, where these derivatives equal zero.
By solving those math problems, we get an exact closed mathematical formula for both A and B, providing us with a direct route to the most accurate linear model.
Image by author. Obtaining OLS closed mathematical functions
This translates into defining some lines of code
to find this mathematical closed solution, so it is quite straightforward
Approach #2: Gradient Descent — The Hike Down the Mountain
Gradient descent is a pivotal optimization algorithm used to minimize the cost function, aiding us in our aim to find the most accurate weight values for our predictive model.
Envision standing atop a hill, your objective is the valley below — this represents our cost function’s minimum point.
To reach it, we begin with initial guesses for our weights, A and B, and iteratively refine these guesses.
The process is akin to descending a hill: with each step, we assess our surroundings and adjust our trajectory to ensure each subsequent step brings us closer to the valley floor.
These steps are guided by the learning rate — a vital hyperparameter symbolized as lr in the equations. This learning rate controls the size of our steps or adjustments to the parameters A and B, ensuring that we do not overshoot the minimum.
Weight iterative equations. Every step gets us closer to the optimal solution.
As we take each step, we calculate the partial derivatives of the cost function with respect to A and B, denoted as dA and dB respectively. These derivatives point us in the direction where the cost function decreases the fastest, akin to finding the steepest descent on our metaphorical hill.
The updated equations for A and B in each iteration, factoring in the learning rate, are as follows:
This meticulous process is repeated until we reach a point where the cost function’s decrease is negligible, suggesting we’ve arrived at or near the global minimum — our destination where the predictive error is minimized, and our model’s accuracy is maximized.
Image by author. Representation of Gradient Descent.
This translates into defining two main functions:
The function to compute MSE
The function to update A and B
We initialize our code with:
A = 0
B = 0
A learning rate of 0.0001(The learning rate allows the algorithm to learn faster or slower).
A max number of iterations
So the final code would be as follows:
Approach #3: Sci-Kit Learn — The Python Power Move
For those who love Python, Sci-Kit Learn is the Swiss Army knife for machine learning. It’s packed with tools for regression, classification, clustering, and more.
All we need to do is import the LinearRegression library, create an object, train it with our data.
This translates into a few lines of code:
Voila!
We have our model.
#3. Final Results
After applying our techniques, we get our formula:
Image by author. Final Results.
With the data we crunched, A turned out to be around 7.17, and B was approximately -350.73.
What does this mean?
For every inch of height, the weight increases by about 7.17 units, minus our intercept value.
The Assumptions Game
No model is perfect, and linear regression relies on certain assumptions:
Linearity: Our data should form a line when plotted.Remember we already checked this linear pattern in our first analysis with a simple scatter plot.
Independence: Our input variables need to be independent.
Normal distribution of residuals: The differences between our observed and predicted values should form a bell curve when plotted.
Residuals plot.
Equal variance of residuals: The spread of our errors should be consistent across all values of our independent variable.
Distribution of residuals.
Final Thoughts
Linear regression might seem deceptively simple, but it’s a powerful tool in your data science arsenal.
Remember, the beauty of data analysis lies in its simplicity.
It’s not about using complex models, it’s about using the right model for the right job.
With these concepts in your pocket, you’re well on your way to uncovering the stories hidden within your data.
Stay curious, and keep exploring the data universe! 🤓
You can go check the code in the following GitHub repo.
Don’t forget to follow ForCode’Sake to get more articles like this one! ✨
Did you like this MLBasics issue? Then you can subscribe to my DataBites Newsletter to stay tuned and receive my content right to your mail!
I promise it will be unique!
You can also find me on X, Threads and LinkedIn, where I post daily cheatsheets about ML, SQL, Python and DataViz.