Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4904

Abstract

6">Building Regression Models in Python</h2><p id="bff8">Building regression models using Python is “easy” thanks to several libraries such as Scikit-learn, TensorFlow, and Keras. We’ll focus on Scikit-learn as it’s the easiest library to get started.</p><p id="059e">Scikit-learn is a popular machine learning library in Python. It provides a simple and efficient way to build regression models. Here’s a step-by-step guide to building a regression model using Scikit-learn:</p><p id="ab0d"><b>Step 1: Import the libraries: </b>Once Scikit-learn is installed (using <code>pip</code> for example), you can start importing the necessary libraries. We’ll also need Pandas.</p><div id="8e1f"><pre><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd <span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split <span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> LinearRegression <span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_squared_error</pre></div><ul><li><b>Step 2: Load the data:</b> We need to split the dataset between x and y.</li></ul><div id="266e"><pre>data = pd.read_csv(<span class="hljs-string">'data.csv'</span>) X = data.iloc[:, :-<span class="hljs-number">1</span>].values y = data.iloc[:, -<span class="hljs-number">1</span>].values</pre></div><ul><li><b>Step 3: Split the data into training and testing sets:</b> Thanks to Scikit-learn, we can just use <code>train_test_split</code> for this step.</li></ul><div id="5920"><pre>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">0</span>)</pre></div><ul><li><b>Step 4: Fit the model:</b> We’ll use the Scikit-learn <code>LinearRegression</code> model for this step. If you want to use another type of regression, you can just import this other model.</li></ul><div id="8163"><pre>regressor = LinearRegression() regressor.fit(X_train, y_train)</pre></div><ul><li><b>Step 5: Evaluate the model: </b>The Mean Squared Error is one of the best metrics we can use to evaluate a regression. There are alternatives such as accuracy or precision.</li></ul><div id="46e1"><pre>y_pred = regressor.predict(X_test) mse = mean_squared_error(y_test, y_pred) <span class="hljs-built_in">print</span>(<span class="hljs-string">'Mean Squared Error:'</span>, mse)</pre></div><ul><li><b>Step 6: Make predictions: </b>Once our model is trained, we can just use it to make predictions with any x value.</li></ul><div id="8c9d"><pre>new_data = [[<span class="hljs-number">5.1</span>, <span class="hljs-number">3.5</span>, <span class="hljs-number">1.4</span>, <span class="hljs-number">0.2</span>]] new_pred = regressor.predict(new_data) <span class="hljs-built_in">print</span>(<span class="hljs-string">'Predicted Value:'</span>, new_pred[<span class="hljs-number">0</span>])</pre></div><h2 id="33ff">Fine-Tuning a Regression</h2><p id="908f">Fine-tuning regression models is an important step in building accurate models. Here are some techniques you can use to fine-tune your regression models:</p><p id="9c4c"><b>Cross-validation</b>: Cross-validation is a technique that helps you to evaluate the performance of your model and fine-tune its hyperparameters. It involves dividing your data into several folds, and using one fold for testing and the other folds for training. This process is repeated several times to ensure that all the data is used for both training and testing. Cross-validation is useful in preventing overfitting and improving the generalization of your model.</p><p id="8f13"><b>Regularization</b>: Regularization is a technique used to prevent overfitting by adding a penalty term to the regression model’s objective function. This penalty term discourages large weights for the model’s coefficients. Regularization can be achieved using two common methods: L1 regularization (Lasso) and L2 regularization (Ridge):</p><ul><li><b>L1 regularization (Lasso)</b>: L1 regularization adds the absolute values of the coefficients to the objective function. It forces the model to use only the most important features and sets the coefficients of the unimportant features to zero.</li></ul><div id="9abe"><pre><span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> Lasso lasso = Lasso(alpha=<span class="hljs-number">0.1</span>) lasso.fit(X_train, y_train)</pre></div><ul><li><b>L2 regularization (Ridge)</b>: L2 regularization adds the square of the coefficients to the objective function. It forces the model to use all the features but sets the coefficients of the unimportant features to very small values.</li></ul><div id="b963"><pre><span class="hljs-keyword">from</span> sklearn.linear_model <sp

Options

an class="hljs-keyword">import</span> Ridge ridge = Ridge(alpha=<span class="hljs-number">0.1</span>) ridge.fit(X_train, y_train)</pre></div><p id="de3f"><b>Feature selection</b>: Feature selection is the process of selecting the most important features in your data that contribute the most to your model’s performance. This helps to reduce the complexity of the model and improve its performance. Feature selection can be achieved using several techniques such as:</p><ul><li>Univariate feature selection</li><li>Recursive feature elimination</li><li>Principal component analysis</li></ul><div id="9011"><pre><span class="hljs-keyword">from</span> sklearn.feature_selection <span class="hljs-keyword">import</span> SelectKBest, f_regression selector = SelectKBest(score_func=f_regression, k=<span class="hljs-number">2</span>) X_train_new = selector.fit_transform(X_train, y_train) X_test_new = selector.transform(X_test)</pre></div><h2 id="db07">Case Study</h2><p id="8e87">Regression analysis can be applied to a wide range of real-world problems, such as predicting house prices or customer churn. In this section, we will provide an example of how to use regression to predict house prices.</p><p id="8ac9">Our task is to predict the prices of houses in a certain area based on several factors such as the number of bedrooms, bathrooms, square footage, and location. We have a dataset of historical house prices and their features. Our goal is to build a regression model that can accurately predict house prices based on these features.</p><p id="4392">We will use the Boston Housing Dataset from Scikit-learn library. This dataset contains information about the houses in Boston, including their prices and features such as the number of rooms, crime rate, and accessibility to highways.</p><p id="2748">We can load it this way:</p><div id="1fed"><pre>from sklearn.datasets <span class="hljs-keyword">import</span> <span class="hljs-type">load_boston</span>

<span class="hljs-variable">boston_dataset</span> <span class="hljs-operator">=</span> load_boston() boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston[<span class="hljs-string">'PRICE'</span>] = boston_dataset.<span class="hljs-type">target</span> <span class="hljs-variable">X</span> <span class="hljs-operator">=</span> boston.drop(<span class="hljs-string">'PRICE'</span>, axis=<span class="hljs-number">1</span>) y = boston[<span class="hljs-string">'PRICE'</span>]</pre></div><p id="1526">Now, with the x and y arrays, we can just complete the remaining steps easily to predict the house prices! I let you try to do this.</p><h2 id="23d3">Final Note</h2><p id="2d81">As you can see, solving a linear regression problem is not that hard thanks to some Python libraries. And it’s very useful as a lot of things in life just follow a linear-like relationship.</p><p id="e3aa">In the next articles, we’ll cover some other approaches to data science. Be sure to follow me if you don’t want to miss the next articles!</p><p id="96c3"><i>To explore the other stories of this story, click below!</i></p><div id="ad69" class="link-block"> <a href="https://readmedium.com/data-science-with-python-32da1e5c3d2f"> <div> <div> <h2>Data Science with Python</h2> <div><h3>Aka the best programming language for data scientists</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*d7J13Ipreaf-8k5j)"></div> </div> </div> </a> </div><p id="dc9e"><i>To explore more of my Python stories, click <a href="https://readmedium.com/tech-aa824bad0d67">here</a>! You can also access all my content by checking <a href="https://readmedium.com/about-me-d63607c8c341">this page</a>.</i></p><p id="8692"><i>If you want to be notified every time I publish a new story, subscribe to me via email by clicking <a href="https://medium.com/subscribe/@estebanthi">here</a>!</i></p><p id="7770"><i>If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:</i></p><div id="a2d6" class="link-block"> <a href="https://medium.com/@estebanthi/membership"> <div> <div> <h2>Join Medium with my referral link — Esteban Thilliez</h2> <div><h3>Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*IoN4BofrwCNWA_bS)"></div> </div> </div> </a> </div></article></body>

Data Science with Python — Regression

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

Regression is a fundamental technique in data science that enables us to analyze and model the relationship between variables. Whether you are trying to predict future sales, understand the impact of marketing campaigns, or identify the key factors that influence customer behavior, regression is an essential tool in your data science toolbox.

Today, we will explore the basics of regression analysis in Python.

Types of Regression

In regression analysis, the main goal is to predict a continuous output variable based on one or more input variables. There are several types of regression, each suited to different types of data and research questions. Here are some of the most common types of regression:

Linear regression: This is the simplest form of regression, in which we try to find a straight line that best fits the data. It is used when the relationship between the input and output variables is expected to be linear.
Logistic regression: This is used when the output variable is binary (e.g., yes or no, 0 or 1) and we want to predict the probability of the output being one of the two values.
Polynomial regression: This is used when the relationship between the input and output variables is expected to be nonlinear. It involves fitting a polynomial equation to the data.
Ridge and Lasso regression: These are two types of regularized regression that are used to prevent overfitting. They add a penalty term to the regression equation to reduce the impact of the input variables that are less important.
Time series regression: This is used when the data is collected over time, and the goal is to predict future values based on historical data.
Bayesian regression: This is a type of regression that uses Bayesian inference to estimate the parameters of the model and make predictions.

Depending on the problem you want to solve, you may choose one regression or another.

Linear Regression

I won’t explain each type of regression, but linear regression is probably the most important to know, so I’ll explain it more in details.

The basic idea of linear regression is to fit a straight line to the data that best represents the linear relationship between the variables. Mathematically, linear regression can be represented as follows:

Y = β0 + β1X + ε

where Y is the dependent variable, X is the independent variable, β0 is the intercept, β1 is the coefficient that represents the slope of the line, and ε is the error term that captures the randomness or variability in the data. The goal of linear regression is to estimate the values of β0 and β1 that minimize the sum of the squared errors (SSE) between the predicted values and the actual values of the dependent variable. In other words, we want to find the line that is closest to the data points.

The equation for the least squares estimate of β1 is:

β1 = (Σ (Xi — X)(Yi — Y)) / (Σ (Xi — X)²)

where Xi and Yi are the values of the independent and dependent variables for observation i, X and Y are the mean values of the independent and dependent variables, and Σ represents the sum of the values over all observations. The equation for the least squares estimate of β0 is:

β0 = Y — β1X

Once we have estimated the values of β0 and β1, we can use them to make predictions of the dependent variable for new values of the independent variable. Linear regression can be extended to multiple independent variables by using multiple regression, which involves fitting a hyperplane to the data.

Building Regression Models in Python

Building regression models using Python is “easy” thanks to several libraries such as Scikit-learn, TensorFlow, and Keras. We’ll focus on Scikit-learn as it’s the easiest library to get started.

Scikit-learn is a popular machine learning library in Python. It provides a simple and efficient way to build regression models. Here’s a step-by-step guide to building a regression model using Scikit-learn:

Step 1: Import the libraries: Once Scikit-learn is installed (using pip for example), you can start importing the necessary libraries. We’ll also need Pandas.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Step 2: Load the data: We need to split the dataset between x and y.

data = pd.read_csv('data.csv')
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

Step 3: Split the data into training and testing sets: Thanks to Scikit-learn, we can just use train_test_split for this step.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Step 4: Fit the model: We’ll use the Scikit-learn LinearRegression model for this step. If you want to use another type of regression, you can just import this other model.

regressor = LinearRegression()
regressor.fit(X_train, y_train)

Step 5: Evaluate the model: The Mean Squared Error is one of the best metrics we can use to evaluate a regression. There are alternatives such as accuracy or precision.

y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)

Step 6: Make predictions: Once our model is trained, we can just use it to make predictions with any x value.

new_data = [[5.1, 3.5, 1.4, 0.2]]
new_pred = regressor.predict(new_data)
print('Predicted Value:', new_pred[0])

Fine-Tuning a Regression

Fine-tuning regression models is an important step in building accurate models. Here are some techniques you can use to fine-tune your regression models:

Cross-validation: Cross-validation is a technique that helps you to evaluate the performance of your model and fine-tune its hyperparameters. It involves dividing your data into several folds, and using one fold for testing and the other folds for training. This process is repeated several times to ensure that all the data is used for both training and testing. Cross-validation is useful in preventing overfitting and improving the generalization of your model.

Regularization: Regularization is a technique used to prevent overfitting by adding a penalty term to the regression model’s objective function. This penalty term discourages large weights for the model’s coefficients. Regularization can be achieved using two common methods: L1 regularization (Lasso) and L2 regularization (Ridge):

L1 regularization (Lasso): L1 regularization adds the absolute values of the coefficients to the objective function. It forces the model to use only the most important features and sets the coefficients of the unimportant features to zero.

from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

L2 regularization (Ridge): L2 regularization adds the square of the coefficients to the objective function. It forces the model to use all the features but sets the coefficients of the unimportant features to very small values.

from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)

Feature selection: Feature selection is the process of selecting the most important features in your data that contribute the most to your model’s performance. This helps to reduce the complexity of the model and improve its performance. Feature selection can be achieved using several techniques such as:

Univariate feature selection
Recursive feature elimination
Principal component analysis

from sklearn.feature_selection import SelectKBest, f_regression
selector = SelectKBest(score_func=f_regression, k=2)
X_train_new = selector.fit_transform(X_train, y_train)
X_test_new = selector.transform(X_test)

Case Study

Regression analysis can be applied to a wide range of real-world problems, such as predicting house prices or customer churn. In this section, we will provide an example of how to use regression to predict house prices.

Our task is to predict the prices of houses in a certain area based on several factors such as the number of bedrooms, bathrooms, square footage, and location. We have a dataset of historical house prices and their features. Our goal is to build a regression model that can accurately predict house prices based on these features.

We will use the Boston Housing Dataset from Scikit-learn library. This dataset contains information about the houses in Boston, including their prices and features such as the number of rooms, crime rate, and accessibility to highways.

We can load it this way:

from sklearn.datasets import load_boston

boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['PRICE'] = boston_dataset.target
X = boston.drop('PRICE', axis=1)
y = boston['PRICE']

Now, with the x and y arrays, we can just complete the remaining steps easily to predict the house prices! I let you try to do this.

Final Note

As you can see, solving a linear regression problem is not that hard thanks to some Python libraries. And it’s very useful as a lot of things in life just follow a linear-like relationship.

In the next articles, we’ll cover some other approaches to data science. Be sure to follow me if you don’t want to miss the next articles!

To explore the other stories of this story, click below!

Data Science with Python

Aka the best programming language for data scientists

medium.com

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Join Medium with my referral link — Esteban Thilliez

Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…

medium.com