Data Science with Python — Regression

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:
Regression is a fundamental technique in data science that enables us to analyze and model the relationship between variables. Whether you are trying to predict future sales, understand the impact of marketing campaigns, or identify the key factors that influence customer behavior, regression is an essential tool in your data science toolbox.
Today, we will explore the basics of regression analysis in Python.
Types of Regression
In regression analysis, the main goal is to predict a continuous output variable based on one or more input variables. There are several types of regression, each suited to different types of data and research questions. Here are some of the most common types of regression:
- Linear regression: This is the simplest form of regression, in which we try to find a straight line that best fits the data. It is used when the relationship between the input and output variables is expected to be linear.
- Logistic regression: This is used when the output variable is binary (e.g., yes or no, 0 or 1) and we want to predict the probability of the output being one of the two values.
- Polynomial regression: This is used when the relationship between the input and output variables is expected to be nonlinear. It involves fitting a polynomial equation to the data.
- Ridge and Lasso regression: These are two types of regularized regression that are used to prevent overfitting. They add a penalty term to the regression equation to reduce the impact of the input variables that are less important.
- Time series regression: This is used when the data is collected over time, and the goal is to predict future values based on historical data.
- Bayesian regression: This is a type of regression that uses Bayesian inference to estimate the parameters of the model and make predictions.
Depending on the problem you want to solve, you may choose one regression or another.
Linear Regression
I won’t explain each type of regression, but linear regression is probably the most important to know, so I’ll explain it more in details.
The basic idea of linear regression is to fit a straight line to the data that best represents the linear relationship between the variables. Mathematically, linear regression can be represented as follows:
Y = β0 + β1X + ε
where Y is the dependent variable, X is the independent variable, β0 is the intercept, β1 is the coefficient that represents the slope of the line, and ε is the error term that captures the randomness or variability in the data. The goal of linear regression is to estimate the values of β0 and β1 that minimize the sum of the squared errors (SSE) between the predicted values and the actual values of the dependent variable. In other words, we want to find the line that is closest to the data points.
The equation for the least squares estimate of β1 is:
β1 = (Σ (Xi — X)(Yi — Y)) / (Σ (Xi — X)²)
where Xi and Yi are the values of the independent and dependent variables for observation i, X and Y are the mean values of the independent and dependent variables, and Σ represents the sum of the values over all observations. The equation for the least squares estimate of β0 is:
β0 = Y — β1X
Once we have estimated the values of β0 and β1, we can use them to make predictions of the dependent variable for new values of the independent variable. Linear regression can be extended to multiple independent variables by using multiple regression, which involves fitting a hyperplane to the data.

Building Regression Models in Python
Building regression models using Python is “easy” thanks to several libraries such as Scikit-learn, TensorFlow, and Keras. We’ll focus on Scikit-learn as it’s the easiest library to get started.
Scikit-learn is a popular machine learning library in Python. It provides a simple and efficient way to build regression models. Here’s a step-by-step guide to building a regression model using Scikit-learn:
Step 1: Import the libraries: Once Scikit-learn is installed (using pip
for example), you can start importing the necessary libraries. We’ll also need Pandas.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
- Step 2: Load the data: We need to split the dataset between x and y.
data = pd.read_csv('data.csv')
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
- Step 3: Split the data into training and testing sets: Thanks to Scikit-learn, we can just use
train_test_split
for this step.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
- Step 4: Fit the model: We’ll use the Scikit-learn
LinearRegression
model for this step. If you want to use another type of regression, you can just import this other model.
regressor = LinearRegression() regressor.fit(X_train, y_train)
- Step 5: Evaluate the model: The Mean Squared Error is one of the best metrics we can use to evaluate a regression. There are alternatives such as accuracy or precision.
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)
- Step 6: Make predictions: Once our model is trained, we can just use it to make predictions with any x value.
new_data = [[5.1, 3.5, 1.4, 0.2]]
new_pred = regressor.predict(new_data)
print('Predicted Value:', new_pred[0])
Fine-Tuning a Regression
Fine-tuning regression models is an important step in building accurate models. Here are some techniques you can use to fine-tune your regression models:
Cross-validation: Cross-validation is a technique that helps you to evaluate the performance of your model and fine-tune its hyperparameters. It involves dividing your data into several folds, and using one fold for testing and the other folds for training. This process is repeated several times to ensure that all the data is used for both training and testing. Cross-validation is useful in preventing overfitting and improving the generalization of your model.
Regularization: Regularization is a technique used to prevent overfitting by adding a penalty term to the regression model’s objective function. This penalty term discourages large weights for the model’s coefficients. Regularization can be achieved using two common methods: L1 regularization (Lasso) and L2 regularization (Ridge):
- L1 regularization (Lasso): L1 regularization adds the absolute values of the coefficients to the objective function. It forces the model to use only the most important features and sets the coefficients of the unimportant features to zero.
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
- L2 regularization (Ridge): L2 regularization adds the square of the coefficients to the objective function. It forces the model to use all the features but sets the coefficients of the unimportant features to very small values.
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
Feature selection: Feature selection is the process of selecting the most important features in your data that contribute the most to your model’s performance. This helps to reduce the complexity of the model and improve its performance. Feature selection can be achieved using several techniques such as:
- Univariate feature selection
- Recursive feature elimination
- Principal component analysis
from sklearn.feature_selection import SelectKBest, f_regression
selector = SelectKBest(score_func=f_regression, k=2)
X_train_new = selector.fit_transform(X_train, y_train)
X_test_new = selector.transform(X_test)
Case Study
Regression analysis can be applied to a wide range of real-world problems, such as predicting house prices or customer churn. In this section, we will provide an example of how to use regression to predict house prices.
Our task is to predict the prices of houses in a certain area based on several factors such as the number of bedrooms, bathrooms, square footage, and location. We have a dataset of historical house prices and their features. Our goal is to build a regression model that can accurately predict house prices based on these features.
We will use the Boston Housing Dataset from Scikit-learn library. This dataset contains information about the houses in Boston, including their prices and features such as the number of rooms, crime rate, and accessibility to highways.
We can load it this way:
from sklearn.datasets import load_boston
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['PRICE'] = boston_dataset.target
X = boston.drop('PRICE', axis=1)
y = boston['PRICE']
Now, with the x and y arrays, we can just complete the remaining steps easily to predict the house prices! I let you try to do this.
Final Note
As you can see, solving a linear regression problem is not that hard thanks to some Python libraries. And it’s very useful as a lot of things in life just follow a linear-like relationship.
In the next articles, we’ll cover some other approaches to data science. Be sure to follow me if you don’t want to miss the next articles!
To explore the other stories of this story, click below!
To explore more of my Python stories, click here! You can also access all my content by checking this page.
If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!
If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link: