Simplifying Linear Regression — A Beginner’s Guide with a real-world Practical Example

Dive into Data Science with a Practical Height-Weight Prediction Model

Image by the author. ML Basics. Simple Linear Regression Example

Linear regression is one of the most fundamental algorithms in the machine learning universe, and it’s like the ABC of analytics.

Think of it like making the best line through a scatter of stars in the sky, where each star represents a data point.

Today, we’re diving into real-world ML with one of the classic examples of linear dependency — height and weight 👇🏻

Understanding the Basics

Simple linear regression is the most basic ML algorithm. It’s where most of us start our journey into data modeling.

Imagine plotting a graph of height and weight, you’ll likely see a line that roughly shows as height increases, so does weight.

That’s linear regression in a nutshell:

Finding the straight line — mathematically speaking — that best fits our data.

Image by the author. Simple Linear Regression.

So let’s see how we can do this…

#1. Exploring the Data - A Peek Before the Leap

Before diving into the analysis, let’s get friendly with our data.

Picture a table with three columns: gender, height, and weight.

A quick glance with df.info() in Python tells us we have thousands of entries and, importantly, no missing values.

And what about the distribution of our data?

Imagine two bell-shaped curves, one for height and one for weight, both as symmetrical as a butterfly’s wings — that’s our normal distribution right there.

Image by the author. Normal distribution for both weight and height.

#2. How to Find Our Best-Fit Line?

To obtain our best-fiting line, there are different approaches. Today, most of us would take advantage of scikit-learn and apply the pre-build Linear Regression algorithm.

However, for this very first time let’s try three different approaches:

2 DIY — generating the algorithms from scratch all by ourselves.
1 final approach taking advantage of scikit-learn.

Approach #1: Ordinary Least Squares (OLS)

The objective of Ordinary Least Squares (OLS) is to determine the optimal coefficients A and B by minimizing the aggregate of squared prediction errors (MSE) — that remember, it is the cost function of Linear Regression.

Leveraging calculus, we exploit the properties of partial derivatives to locate the minima of the cost function, where these derivatives equal zero.

By solving those math problems, we get an exact closed mathematical formula for both A and B, providing us with a direct route to the most accurate linear model.

Image by author. Obtaining OLS closed mathematical functions

This translates into defining some lines of code to find this mathematical closed solution, so it is quite straightforward

Approach #2: Gradient Descent — The Hike Down the Mountain

Gradient descent is a pivotal optimization algorithm used to minimize the cost function, aiding us in our aim to find the most accurate weight values for our predictive model.

Envision standing atop a hill, your objective is the valley below — this represents our cost function’s minimum point.

To reach it, we begin with initial guesses for our weights, A and B, and iteratively refine these guesses.

The process is akin to descending a hill: with each step, we assess our surroundings and adjust our trajectory to ensure each subsequent step brings us closer to the valley floor.

These steps are guided by the learning rate — a vital hyperparameter symbolized as lr in the equations. This learning rate controls the size of our steps or adjustments to the parameters A and B, ensuring that we do not overshoot the minimum.

Weight iterative equations. Every step gets us closer to the optimal solution.

As we take each step, we calculate the partial derivatives of the cost function with respect to A and B, denoted as dA and dB respectively. These derivatives point us in the direction where the cost function decreases the fastest, akin to finding the steepest descent on our metaphorical hill.

The updated equations for A and B in each iteration, factoring in the learning rate, are as follows:

This meticulous process is repeated until we reach a point where the cost function’s decrease is negligible, suggesting we’ve arrived at or near the global minimum — our destination where the predictive error is minimized, and our model’s accuracy is maximized.

Image by author. Representation of Gradient Descent.

This translates into defining two main functions:

The function to compute MSE

The function to update A and B

We initialize our code with:

A = 0
B = 0
A learning rate of 0.0001(The learning rate allows the algorithm to learn faster or slower).
A max number of iterations

So the final code would be as follows:

Approach #3: Sci-Kit Learn — The Python Power Move

For those who love Python, Sci-Kit Learn is the Swiss Army knife for machine learning. It’s packed with tools for regression, classification, clustering, and more.

All we need to do is import the LinearRegression library, create an object, train it with our data.

This translates into a few lines of code:

Voila!

We have our model.

#3. Final Results

After applying our techniques, we get our formula:

With the data we crunched, A turned out to be around 7.17, and B was approximately -350.73.

What does this mean?

For every inch of height, the weight increases by about 7.17 units, minus our intercept value.

The Assumptions Game

No model is perfect, and linear regression relies on certain assumptions:

Linearity: Our data should form a line when plotted.Remember we already checked this linear pattern in our first analysis with a simple scatter plot.
Independence: Our input variables need to be independent.
Normal distribution of residuals: The differences between our observed and predicted values should form a bell curve when plotted.

Equal variance of residuals: The spread of our errors should be consistent across all values of our independent variable.

Final Thoughts

Linear regression might seem deceptively simple, but it’s a powerful tool in your data science arsenal.

Remember, the beauty of data analysis lies in its simplicity.

It’s not about using complex models, it’s about using the right model for the right job.

With these concepts in your pocket, you’re well on your way to uncovering the stories hidden within your data.

Stay curious, and keep exploring the data universe! 🤓

You can go check the code in the following GitHub repo.

Don’t forget to follow ForCode’Sake to get more articles like this one! ✨

Did you like this MLBasics issue? Then you can subscribe to my DataBites Newsletter to stay tuned and receive my content right to your mail!

I promise it will be unique!

You can also find me on X, Threads and LinkedIn, where I post daily cheatsheets about ML, SQL, Python and DataViz.

Breaking down Logistic Regression to its basics

MLBasics #2: Demystifying Machine Learning Algorithms with The Simplicity of Logistic Regression

towardsdatascience.com

MLBasics — Simple Linear Regression

Demystifying Machine Learning Algorithms with The Simplicity of Linear Regression

towardsdatascience.com

LLM Vector DataBases: Get Started with Pinecone

A Quickstart Guide for Getting your Pinecone API key

medium.com