Free AI web copilot to create summaries, insights and extended knowledge, download it at here

8516

Abstract

14</span>.<span class="hljs-number">310000</span> <span class="hljs-attribute">MedHouseVal</span> <span class="hljs-number">2</span>.<span class="hljs-number">068558</span> <span class="hljs-number">1</span>.<span class="hljs-number">153956</span> <span class="hljs-number">0</span>.<span class="hljs-number">149990</span> <span class="hljs-number">5</span>.<span class="hljs-number">000010</span></pre></div><p id="d1ff">We can also use matplotlib to visualize the data differently:</p><div id="af73"><pre><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt <span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> fetch_california_housing

california = fetch_california_housing()

<span class="hljs-comment"># Plot the distribution of the target variable (median house value)</span> plt.hist(california.target, bins=<span class="hljs-number">50</span>) plt.title(<span class="hljs-string">'Distribution of Median House Values'</span>) plt.xlabel(<span class="hljs-string">'Median House Value (in units of 100,000 dollars)'</span>) plt.ylabel(<span class="hljs-string">'Frequency'</span>) plt.show()

<span class="hljs-comment"># Plot a scatter plot of the latitude and longitude features</span> plt.scatter(california.data[:, -<span class="hljs-number">1</span>], california.data[:, -<span class="hljs-number">2</span>], c=california.target) plt.title(<span class="hljs-string">'Geographical Distribution of Median House Values'</span>) plt.xlabel(<span class="hljs-string">'Longitude'</span>) plt.ylabel(<span class="hljs-string">'Latitude'</span>) plt.colorbar() plt.show()</pre></div><figure id="4f91"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*oLpKoRsvHrjHfuOLb_GatQ.png"><figcaption></figcaption></figure><figure id="fe6f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vDMudYErwqT2yeJRiTQWxw.png"><figcaption></figcaption></figure><h2 id="b2d2">Data Preparation</h2><p id="ccc9">Before we can build a regression model to predict median house values in California, we need to prepare the dataset for training and evaluation. Here are the steps we’ll take:</p><ol><li>Split the dataset into a training set and a test set.</li><li>Standardize the input features so that they have zero mean and unit variance.</li></ol><p id="929e">We’ll split the dataset into a training set and a test set using the <code>train_test_split()</code> function from Scikit-Learn. We'll use a 80/20 split, meaning that 80% of the data will be used for training and 20% will be used for testing. Here's the code to do that:</p><div id="65dc"><pre><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-comment"># Split the data into training and test sets</span> X_train, X_test, y_train, y_test = train_test_split(california.data, california.target, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)</pre></div><p id="c1ae">Here, <code>california.data</code> contains the input features (i.e., the X matrix), and <code>california.target</code> contains the target values (i.e., the y vector). The <code>test_size</code> parameter specifies that we want to use 20% of the data for testing, and the <code>random_state</code> parameter is set to ensure that we get the same split every time we run the code.</p><p id="7468">We’ll use the training set to train our regression model, and the test set to evaluate its performance on new data that it hasn’t seen before. This will give us a more realistic estimate of how well our model will generalize to new data.</p><p id="b98b">To standardize the input features, we’ll use the <code>StandardScaler</code> class from Scikit-Learn.</p><div id="1ee4"><pre><span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler

scaler = StandardScaler() scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test)</pre></div><p id="cc2f">This code creates a <code>StandardScaler</code> object and fits it to the training data using the <code>fit()</code> method. Then, it uses the scaler's <code>transform()</code> method to standardize the training and test sets. Standardization involves subtracting the mean of each feature and dividing by its standard deviation, so that each feature has zero mean and unit variance.</p><p id="2161">Now that our data is preprocessed, we’re ready to start building our regression model!</p><h2 id="3f50">Modeling</h2><p id="54c0">Now that our data is preprocessed, we’re ready to start building our regression model. We’ll use the <code>LinearRegression</code> class from Scikit-Learn to fit a linear regression model to the training data.</p><div id="46f8"><pre>from sklearn.linear_model <span class="hljs-keyword">import</span> <span class="hljs-type">LinearRegression</span>

<span class="hljs-variable">lin_reg</span> <span class="hljs-operator">=</span> LinearRegression() lin_reg.fit(X_train_scaled, y_train)</pre></div><p id="d3af">This code creates a <code>LinearRegression</code> object and fits it to the scaled training data using the <code>fit()</code> method. Once the model is trained, we can use it to make predictions on new data using the <code>predict()</code> method.</p><p id="c058">To evaluate the performance of our model, we’ll use the mean squared error (MSE) and the coefficient of determination (R²) on the test set. The MSE measures the average squared difference between the predicted and actual values, while the R² measures the proportion of the variance in the target variable that is explained by the model.</p><div id="a1c8"><pre><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_squared_error, r2_score

y_pred = lin_reg.predict(X_test_scaled)

mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred)

<span class="hljs-built_in">print</span>(<span class="hljs-string">"Mean Squared Error: {:.2f}"</span>.<span class="hljs-built_in">format</span>(mse)) <span class="hljs-built_in">print</span>(<span class="hljs-string">"R^2: {:.2f}"</span>.<span class="hljs-built_in">format</span>(r2))</pre></div><p id="5ed3">This code uses the <code>predict()</code> method to make predictions on the test set, and then calculates the MSE and R^2 using the <code>mean_squared_error()</code> and <code>r2_score()</code> functions from Scikit-Learn. The results will tell us how well our model is able to generalize to new data.</p><h2 id="e643">Results and Analysis</h2><p id="1796">Our linear regression model has been trained and evaluated, and now we can analyze the results. Let’s start by looking at the coefficients of the model, which tell us how much each input feature contributes to the predicted output.</p><div id="0138"><pre>coef = lin_reg.coef_ intercept = lin_reg.intercept_ <span class="hljs-keyword">for</span> i, c <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(coef): <span class="hljs-built_in">print</span>(<span class="hljs-string">"Coefficient {}: {:.2f}"</span>.<span class="hljs-built_in">format</span>(i+<span class="hljs-number">1</span>, c)) <span class="hljs-built_in">print</span>(<span class="hljs-string">"Intercept: {:.2f}"</span>.<span class="hljs-built_in">format</span>(intercept))</pre></div><p id="dc17">This code uses the <code>coef_</code> and <code>intercept_</code> attributes of the <code>LinearRegression</code> object to get the coefficients and intercept of the model, and then prints them to the console. The coefficients tell us how much each feature contributes to the predicted target, while the intercept is the value of the target when all the input features are zero.</p><p id="0e45">Next, let’s visualize the predicted and actual target values on the test set.</p><div id="78e7"><pre><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-comment"># Create a scatter plot of predicted vs. actual values</span> plt.scatter(y_test, y_pred) plt.xlabel(<span class="hljs-string">"Actual Values"</span>) plt.ylabel(<span class="hljs-string">"Predicted Values"</span>) plt.title(<span class="hljs-string">"Predicted vs. Actual Values"</span>) plt.show()</pre></div><p id="9ab9">If the model is accurate, the points should fall close to the diag

Options

onal line.</p><figure id="7059"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*thKaysON4-l6jtJPrHHJDg.png"><figcaption></figcaption></figure><p id="e1b7">In this particular example, we can see that the points are mostly clustered around the diagonal line, but there are some instances where the prediction is far off from the actual value. This suggests that the model is reasonably accurate overall, but could benefit from further refinement.</p><p id="8327">Finally, let’s visualize the distribution of the residuals, which are the differences between the predicted and actual values.</p><div id="c09a"><pre>residuals = y_test - y_pred plt.hist(residuals, bins=<span class="hljs-number">50</span>) plt.xlabel(<span class="hljs-string">"Residuals"</span>) plt.ylabel(<span class="hljs-string">"Frequency"</span>) plt.title(<span class="hljs-string">"Residual Distribution"</span>) plt.show()</pre></div><figure id="b051"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*bxXniOLtDzmr8uL2_gAbzw.png"><figcaption></figcaption></figure><p id="0970">In this example, we can see that the residuals are mostly clustered around zero, which is a good sign. However, there is a slight skew to the right (and a little to the left too) of the distribution, which suggests that there are some instances where the model is under-predicting the target value. This could be an area for improvement in future iterations of the model.</p><p id="f2e0">By analyzing the coefficients, the scatter plot, and the histogram of residuals, we can gain insights into how well our model is performing and where it may need improvement.</p><h2 id="9b2b">Optimization</h2><p id="982b">After evaluating the performance of the initial model, there are several steps that can be taken to try to improve its accuracy. One approach is to experiment with different regression algorithms and see if they perform better than the linear regression model. Some possible algorithms to try include decision trees, random forests, and neural networks.</p><div id="3367"><pre><span class="hljs-keyword">from</span> sklearn.tree <span class="hljs-keyword">import</span> DecisionTreeRegressor <span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> RandomForestRegressor <span class="hljs-keyword">from</span> sklearn.neural_network <span class="hljs-keyword">import</span> MLPRegressor

<span class="hljs-comment"># Initialize different models</span> dt_model = DecisionTreeRegressor() rf_model = RandomForestRegressor() nn_model = MLPRegressor()

<span class="hljs-comment"># Train and evaluate each model</span> dt_model.fit(X_train, y_train) dt_preds = dt_model.predict(X_test) dt_rmse = mean_squared_error(y_test, dt_preds, squared=<span class="hljs-literal">False</span>)

rf_model.fit(X_train, y_train) rf_preds = rf_model.predict(X_test) rf_rmse = mean_squared_error(y_test, rf_preds, squared=<span class="hljs-literal">False</span>)

nn_model.fit(X_train, y_train) nn_preds = nn_model.predict(X_test) nn_rmse = mean_squared_error(y_test, nn_preds, squared=<span class="hljs-literal">False</span>)

<span class="hljs-built_in">print</span>(<span class="hljs-string">"Decision Tree RMSE: "</span>, dt_rmse) <span class="hljs-built_in">print</span>(<span class="hljs-string">"Random Forest RMSE: "</span>, rf_rmse) <span class="hljs-built_in">print</span>(<span class="hljs-string">"Neural Network RMSE: "</span>, nn_rmse)</pre></div><p id="fc74">For me, the Random Forest seems to be working better than the other algorithms, for this problem. Indeed, if we update our plots with the predictions of the Random Forest model, we get these charts:</p><figure id="e28c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*VEBHSTARUn6Pls3CB7zvrg.png"><figcaption></figcaption></figure><figure id="850b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*2AI5EisEUVA7U7Mbsisv_g.png"><figcaption></figcaption></figure><p id="8653">Another approach is to experiment with different hyperparameters for the linear regression model. For example, we could try adjusting the learning rate or the number of iterations for gradient descent. We could also experiment with adding regularization terms to the loss function to help prevent overfitting.</p><div id="c827"><pre><span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> SGDRegressor

<span class="hljs-comment"># Initialize linear regression model with different hyperparameters</span> lr_model_1 = SGDRegressor(eta0=<span class="hljs-number">0.001</span>, max_iter=<span class="hljs-number">1000</span>) lr_model_2 = SGDRegressor(eta0=<span class="hljs-number">0.01</span>, max_iter=<span class="hljs-number">10000</span>, penalty=<span class="hljs-string">'l2'</span>)

<span class="hljs-comment"># Train and evaluate each model</span> lr_model_1.fit(X_train, y_train) lr_preds_1 = lr_model_1.predict(X_test) lr_rmse_1 = mean_squared_error(y_test, lr_preds_1, squared=<span class="hljs-literal">False</span>)

lr_model_2.fit(X_train, y_train) lr_preds_2 = lr_model_2.predict(X_test) lr_rmse_2 = mean_squared_error(y_test, lr_preds_2, squared=<span class="hljs-literal">False</span>)

<span class="hljs-built_in">print</span>(<span class="hljs-string">"Linear Regression 1 RMSE: "</span>, lr_rmse_1) <span class="hljs-built_in">print</span>(<span class="hljs-string">"Linear Regression 2 RMSE: "</span>, lr_rmse_2)</pre></div><p id="7c0c">In this case, we get very large prediction errors. The reason for this is that the model is likely overfitting the training data, resulting in very large errors when predicting the test set.</p><p id="2c54">In addition to algorithm and hyperparameter tuning, we could also try to engineer new features from the existing data. For example, we could create new features by combining or transforming existing features, or we could try to gather new data that might be more informative for the target variable.</p><p id="ef33">To ensure that we are making progress in our optimization efforts, we should also set aside a validation set in addition to the training and test sets. We can use the validation set to evaluate the performance of different models and hyperparameters and select the ones that perform best before testing them on the final test set.</p><h2 id="cf74">Final Note</h2><p id="d7ec">I think this article is a good recap of many concepts we’ve seen so far. You should be able to solve a lot of problems using this approach.</p><p id="9a57">If you don’t want to miss the other articles of this series, be sure to follow me!</p><p id="202c"><i>To explore the other stories of this story, click below!</i></p><div id="ad69" class="link-block"> <a href="https://readmedium.com/data-science-with-python-32da1e5c3d2f"> <div> <div> <h2>Data Science with Python</h2> <div><h3>Aka the best programming language for data scientists</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*d7J13Ipreaf-8k5j)"></div> </div> </div> </a> </div><p id="dc9e"><i>To explore more of my Python stories, click <a href="https://readmedium.com/tech-aa824bad0d67">here</a>! You can also access all my content by checking <a href="https://readmedium.com/about-me-d63607c8c341">this page</a>.</i></p><p id="8692"><i>If you want to be notified every time I publish a new story, subscribe to me via email by clicking <a href="https://medium.com/subscribe/@estebanthi">here</a>!</i></p><p id="7770"><i>If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:</i></p><div id="a2d6" class="link-block"> <a href="https://medium.com/@estebanthi/membership"> <div> <div> <h2>Join Medium with my referral link — Esteban Thilliez</h2> <div><h3>Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*IoN4BofrwCNWA_bS)"></div> </div> </div> </a> </div></article></body>

Data Science with Python — Predicting House Prices using Regression Analysis

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

In the last article, I left you with a little exercise to practice data science. The objective was to predict house prices based on some features.

Today, I’ll make a detailed correction of it (we’ll just use another dataset since this one has been removed recently), just to make a little summary of what we have seen so far through this series.

Dataset Overview

The California Housing dataset contains information on the median house value for different neighborhoods in California, as well as various features that can be used to predict the median house value. The dataset was collected from the 1990 U.S. Census, and includes 20,640 instances with eight input features and a target variable.

The input features for each instance are:

MedInc: Median income of households in the block.
HouseAge: Median house age in the block.
AveRooms: Average number of rooms per dwelling.
AveBedrms: Average number of bedrooms per dwelling.
Population: Total population in the block.
AveOccup: Average number of occupants per dwelling.
Latitude: Latitude of the block in decimal degrees.
Longitude: Longitude of the block in decimal degrees.

The target variable is the median house value for each neighborhood in units of 100,000 dollars.

To get a better sense of the distribution of values for each feature in the dataset, we can compute some summary statistics. Here’s some code to do that:

Certainly! Here’s an overview of the California Housing dataset:

II. Dataset Overview

The input features for each instance are:

MedInc: Median income of households in the block.
HouseAge: Median house age in the block.
AveRooms: Average number of rooms per dwelling.
AveBedrms: Average number of bedrooms per dwelling.
Population: Total population in the block.
AveOccup: Average number of occupants per dwelling.
Latitude: Latitude of the block in decimal degrees.
Longitude: Longitude of the block in decimal degrees.

The target variable is the median house value for each neighborhood in units of 100,000 dollars.

To get a better sense of the distribution of values for each feature in the dataset, we can compute some summary statistics. Here’s some code to do that:

import pandas as pd
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)
df['MedHouseVal'] = california.target
summary = pd.DataFrame({
    'Mean': df.mean(),
    'Std Dev': df.std(),
    'Min': df.min(),
    'Max': df.max()
})
print(summary)

This code will load the California Housing dataset using the fetch_california_housing() function, and create a Pandas DataFrame from the dataset. It then computes the mean, standard deviation, minimum, and maximum values for each feature using the mean(), std(), min(), and max() methods of the DataFrame, and stores the results in a new DataFrame named summary.

                    Mean      Std Dev         Min           Max
MedInc          3.870671     1.899822    0.499900     15.000100
HouseAge       28.639486    12.585558    1.000000     52.000000
AveRooms        5.429000     2.474173    0.846154    141.909091
AveBedrms       1.096675     0.473911    0.333333     34.066667
Population   1425.476744  1132.462122    3.000000  35682.000000
AveOccup        3.070655    10.386050    0.692308   1243.333333
Latitude       35.631861     2.135952   32.540000     41.950000
Longitude    -119.569704     2.003532 -124.350000   -114.310000
MedHouseVal     2.068558     1.153956    0.149990      5.000010

We can also use matplotlib to visualize the data differently:

import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()

# Plot the distribution of the target variable (median house value)
plt.hist(california.target, bins=50)
plt.title('Distribution of Median House Values')
plt.xlabel('Median House Value (in units of 100,000 dollars)')
plt.ylabel('Frequency')
plt.show()

# Plot a scatter plot of the latitude and longitude features
plt.scatter(california.data[:, -1], california.data[:, -2], c=california.target)
plt.title('Geographical Distribution of Median House Values')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.colorbar()
plt.show()

Data Preparation

Before we can build a regression model to predict median house values in California, we need to prepare the dataset for training and evaluation. Here are the steps we’ll take:

Split the dataset into a training set and a test set.
Standardize the input features so that they have zero mean and unit variance.

We’ll split the dataset into a training set and a test set using the train_test_split() function from Scikit-Learn. We'll use a 80/20 split, meaning that 80% of the data will be used for training and 20% will be used for testing. Here's the code to do that:

from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(california.data, california.target, test_size=0.2, random_state=42)

Here, california.data contains the input features (i.e., the X matrix), and california.target contains the target values (i.e., the y vector). The test_size parameter specifies that we want to use 20% of the data for testing, and the random_state parameter is set to ensure that we get the same split every time we run the code.

We’ll use the training set to train our regression model, and the test set to evaluate its performance on new data that it hasn’t seen before. This will give us a more realistic estimate of how well our model will generalize to new data.

To standardize the input features, we’ll use the StandardScaler class from Scikit-Learn.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

This code creates a StandardScaler object and fits it to the training data using the fit() method. Then, it uses the scaler's transform() method to standardize the training and test sets. Standardization involves subtracting the mean of each feature and dividing by its standard deviation, so that each feature has zero mean and unit variance.

Now that our data is preprocessed, we’re ready to start building our regression model!

Modeling

Now that our data is preprocessed, we’re ready to start building our regression model. We’ll use the LinearRegression class from Scikit-Learn to fit a linear regression model to the training data.

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train_scaled, y_train)

This code creates a LinearRegression object and fits it to the scaled training data using the fit() method. Once the model is trained, we can use it to make predictions on new data using the predict() method.

To evaluate the performance of our model, we’ll use the mean squared error (MSE) and the coefficient of determination (R²) on the test set. The MSE measures the average squared difference between the predicted and actual values, while the R² measures the proportion of the variance in the target variable that is explained by the model.

from sklearn.metrics import mean_squared_error, r2_score

y_pred = lin_reg.predict(X_test_scaled)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error: {:.2f}".format(mse))
print("R^2: {:.2f}".format(r2))

This code uses the predict() method to make predictions on the test set, and then calculates the MSE and R^2 using the mean_squared_error() and r2_score() functions from Scikit-Learn. The results will tell us how well our model is able to generalize to new data.

Results and Analysis

Our linear regression model has been trained and evaluated, and now we can analyze the results. Let’s start by looking at the coefficients of the model, which tell us how much each input feature contributes to the predicted output.

coef = lin_reg.coef_
intercept = lin_reg.intercept_
for i, c in enumerate(coef):
    print("Coefficient {}: {:.2f}".format(i+1, c))
print("Intercept: {:.2f}".format(intercept))

This code uses the coef_ and intercept_ attributes of the LinearRegression object to get the coefficients and intercept of the model, and then prints them to the console. The coefficients tell us how much each feature contributes to the predicted target, while the intercept is the value of the target when all the input features are zero.

Next, let’s visualize the predicted and actual target values on the test set.

import matplotlib.pyplot as plt

# Create a scatter plot of predicted vs. actual values
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Predicted vs. Actual Values")
plt.show()

If the model is accurate, the points should fall close to the diagonal line.

In this particular example, we can see that the points are mostly clustered around the diagonal line, but there are some instances where the prediction is far off from the actual value. This suggests that the model is reasonably accurate overall, but could benefit from further refinement.

Finally, let’s visualize the distribution of the residuals, which are the differences between the predicted and actual values.

residuals = y_test - y_pred
plt.hist(residuals, bins=50)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.title("Residual Distribution")
plt.show()

In this example, we can see that the residuals are mostly clustered around zero, which is a good sign. However, there is a slight skew to the right (and a little to the left too) of the distribution, which suggests that there are some instances where the model is under-predicting the target value. This could be an area for improvement in future iterations of the model.

By analyzing the coefficients, the scatter plot, and the histogram of residuals, we can gain insights into how well our model is performing and where it may need improvement.

Optimization

After evaluating the performance of the initial model, there are several steps that can be taken to try to improve its accuracy. One approach is to experiment with different regression algorithms and see if they perform better than the linear regression model. Some possible algorithms to try include decision trees, random forests, and neural networks.

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor

# Initialize different models
dt_model = DecisionTreeRegressor()
rf_model = RandomForestRegressor()
nn_model = MLPRegressor()

# Train and evaluate each model
dt_model.fit(X_train, y_train)
dt_preds = dt_model.predict(X_test)
dt_rmse = mean_squared_error(y_test, dt_preds, squared=False)

rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)
rf_rmse = mean_squared_error(y_test, rf_preds, squared=False)

nn_model.fit(X_train, y_train)
nn_preds = nn_model.predict(X_test)
nn_rmse = mean_squared_error(y_test, nn_preds, squared=False)

print("Decision Tree RMSE: ", dt_rmse)
print("Random Forest RMSE: ", rf_rmse)
print("Neural Network RMSE: ", nn_rmse)

For me, the Random Forest seems to be working better than the other algorithms, for this problem. Indeed, if we update our plots with the predictions of the Random Forest model, we get these charts:

Another approach is to experiment with different hyperparameters for the linear regression model. For example, we could try adjusting the learning rate or the number of iterations for gradient descent. We could also experiment with adding regularization terms to the loss function to help prevent overfitting.

from sklearn.linear_model import SGDRegressor

# Initialize linear regression model with different hyperparameters
lr_model_1 = SGDRegressor(eta0=0.001, max_iter=1000)
lr_model_2 = SGDRegressor(eta0=0.01, max_iter=10000, penalty='l2')

# Train and evaluate each model
lr_model_1.fit(X_train, y_train)
lr_preds_1 = lr_model_1.predict(X_test)
lr_rmse_1 = mean_squared_error(y_test, lr_preds_1, squared=False)

lr_model_2.fit(X_train, y_train)
lr_preds_2 = lr_model_2.predict(X_test)
lr_rmse_2 = mean_squared_error(y_test, lr_preds_2, squared=False)

print("Linear Regression 1 RMSE: ", lr_rmse_1)
print("Linear Regression 2 RMSE: ", lr_rmse_2)

In this case, we get very large prediction errors. The reason for this is that the model is likely overfitting the training data, resulting in very large errors when predicting the test set.

In addition to algorithm and hyperparameter tuning, we could also try to engineer new features from the existing data. For example, we could create new features by combining or transforming existing features, or we could try to gather new data that might be more informative for the target variable.

To ensure that we are making progress in our optimization efforts, we should also set aside a validation set in addition to the training and test sets. We can use the validation set to evaluate the performance of different models and hyperparameters and select the ones that perform best before testing them on the final test set.

Final Note

I think this article is a good recap of many concepts we’ve seen so far. You should be able to solve a lot of problems using this approach.

If you don’t want to miss the other articles of this series, be sure to follow me!

To explore the other stories of this story, click below!

Data Science with Python

Aka the best programming language for data scientists

medium.com

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Join Medium with my referral link — Esteban Thilliez

Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…

medium.com