Data Science with Python — Predicting House Prices using Regression Analysis
This article is part of the “Datascience with Python” series. You can find the other stories of this series below:
In the last article, I left you with a little exercise to practice data science. The objective was to predict house prices based on some features.
Today, I’ll make a detailed correction of it (we’ll just use another dataset since this one has been removed recently), just to make a little summary of what we have seen so far through this series.
Dataset Overview
The California Housing dataset contains information on the median house value for different neighborhoods in California, as well as various features that can be used to predict the median house value. The dataset was collected from the 1990 U.S. Census, and includes 20,640 instances with eight input features and a target variable.
The input features for each instance are:
MedInc
: Median income of households in the block.HouseAge
: Median house age in the block.AveRooms
: Average number of rooms per dwelling.AveBedrms
: Average number of bedrooms per dwelling.Population
: Total population in the block.AveOccup
: Average number of occupants per dwelling.Latitude
: Latitude of the block in decimal degrees.Longitude
: Longitude of the block in decimal degrees.
The target variable is the median house value for each neighborhood in units of 100,000 dollars.
To get a better sense of the distribution of values for each feature in the dataset, we can compute some summary statistics. Here’s some code to do that:
Certainly! Here’s an overview of the California Housing dataset:
II. Dataset Overview
The California Housing dataset contains information on the median house value for different neighborhoods in California, as well as various features that can be used to predict the median house value. The dataset was collected from the 1990 U.S. Census, and includes 20,640 instances with eight input features and a target variable.
The input features for each instance are:
MedInc
: Median income of households in the block.HouseAge
: Median house age in the block.AveRooms
: Average number of rooms per dwelling.AveBedrms
: Average number of bedrooms per dwelling.Population
: Total population in the block.AveOccup
: Average number of occupants per dwelling.Latitude
: Latitude of the block in decimal degrees.Longitude
: Longitude of the block in decimal degrees.
The target variable is the median house value for each neighborhood in units of 100,000 dollars.
To get a better sense of the distribution of values for each feature in the dataset, we can compute some summary statistics. Here’s some code to do that:
import pandas as pd
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)
df['MedHouseVal'] = california.target
summary = pd.DataFrame({
'Mean': df.mean(),
'Std Dev': df.std(),
'Min': df.min(),
'Max': df.max()
})
print(summary)
This code will load the California Housing dataset using the fetch_california_housing()
function, and create a Pandas DataFrame from the dataset. It then computes the mean, standard deviation, minimum, and maximum values for each feature using the mean()
, std()
, min()
, and max()
methods of the DataFrame, and stores the results in a new DataFrame named summary
.
Mean Std Dev Min Max
MedInc 3.870671 1.899822 0.499900 15.000100
HouseAge 28.639486 12.585558 1.000000 52.000000
AveRooms 5.429000 2.474173 0.846154 141.909091
AveBedrms 1.096675 0.473911 0.333333 34.066667
Population 1425.476744 1132.462122 3.000000 35682.000000
AveOccup 3.070655 10.386050 0.692308 1243.333333
Latitude 35.631861 2.135952 32.540000 41.950000
Longitude -119.569704 2.003532 -124.350000 -114.310000
MedHouseVal 2.068558 1.153956 0.149990 5.000010
We can also use matplotlib to visualize the data differently:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()
# Plot the distribution of the target variable (median house value)
plt.hist(california.target, bins=50)
plt.title('Distribution of Median House Values')
plt.xlabel('Median House Value (in units of 100,000 dollars)')
plt.ylabel('Frequency')
plt.show()
# Plot a scatter plot of the latitude and longitude features
plt.scatter(california.data[:, -1], california.data[:, -2], c=california.target)
plt.title('Geographical Distribution of Median House Values')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.colorbar()
plt.show()


Data Preparation
Before we can build a regression model to predict median house values in California, we need to prepare the dataset for training and evaluation. Here are the steps we’ll take:
- Split the dataset into a training set and a test set.
- Standardize the input features so that they have zero mean and unit variance.
We’ll split the dataset into a training set and a test set using the train_test_split()
function from Scikit-Learn. We'll use a 80/20 split, meaning that 80% of the data will be used for training and 20% will be used for testing. Here's the code to do that:
from sklearn.model_selection import train_test_split
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(california.data, california.target, test_size=0.2, random_state=42)
Here, california.data
contains the input features (i.e., the X matrix), and california.target
contains the target values (i.e., the y vector). The test_size
parameter specifies that we want to use 20% of the data for testing, and the random_state
parameter is set to ensure that we get the same split every time we run the code.
We’ll use the training set to train our regression model, and the test set to evaluate its performance on new data that it hasn’t seen before. This will give us a more realistic estimate of how well our model will generalize to new data.
To standardize the input features, we’ll use the StandardScaler
class from Scikit-Learn.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
This code creates a StandardScaler
object and fits it to the training data using the fit()
method. Then, it uses the scaler's transform()
method to standardize the training and test sets. Standardization involves subtracting the mean of each feature and dividing by its standard deviation, so that each feature has zero mean and unit variance.
Now that our data is preprocessed, we’re ready to start building our regression model!
Modeling
Now that our data is preprocessed, we’re ready to start building our regression model. We’ll use the LinearRegression
class from Scikit-Learn to fit a linear regression model to the training data.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train_scaled, y_train)
This code creates a LinearRegression
object and fits it to the scaled training data using the fit()
method. Once the model is trained, we can use it to make predictions on new data using the predict()
method.
To evaluate the performance of our model, we’ll use the mean squared error (MSE) and the coefficient of determination (R²) on the test set. The MSE measures the average squared difference between the predicted and actual values, while the R² measures the proportion of the variance in the target variable that is explained by the model.
from sklearn.metrics import mean_squared_error, r2_score
y_pred = lin_reg.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error: {:.2f}".format(mse))
print("R^2: {:.2f}".format(r2))
This code uses the predict()
method to make predictions on the test set, and then calculates the MSE and R^2 using the mean_squared_error()
and r2_score()
functions from Scikit-Learn. The results will tell us how well our model is able to generalize to new data.
Results and Analysis
Our linear regression model has been trained and evaluated, and now we can analyze the results. Let’s start by looking at the coefficients of the model, which tell us how much each input feature contributes to the predicted output.
coef = lin_reg.coef_
intercept = lin_reg.intercept_
for i, c in enumerate(coef):
print("Coefficient {}: {:.2f}".format(i+1, c))
print("Intercept: {:.2f}".format(intercept))
This code uses the coef_
and intercept_
attributes of the LinearRegression
object to get the coefficients and intercept of the model, and then prints them to the console. The coefficients tell us how much each feature contributes to the predicted target, while the intercept is the value of the target when all the input features are zero.
Next, let’s visualize the predicted and actual target values on the test set.
import matplotlib.pyplot as plt
# Create a scatter plot of predicted vs. actual values
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Predicted vs. Actual Values")
plt.show()
If the model is accurate, the points should fall close to the diagonal line.

In this particular example, we can see that the points are mostly clustered around the diagonal line, but there are some instances where the prediction is far off from the actual value. This suggests that the model is reasonably accurate overall, but could benefit from further refinement.
Finally, let’s visualize the distribution of the residuals, which are the differences between the predicted and actual values.
residuals = y_test - y_pred
plt.hist(residuals, bins=50)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.title("Residual Distribution")
plt.show()

In this example, we can see that the residuals are mostly clustered around zero, which is a good sign. However, there is a slight skew to the right (and a little to the left too) of the distribution, which suggests that there are some instances where the model is under-predicting the target value. This could be an area for improvement in future iterations of the model.
By analyzing the coefficients, the scatter plot, and the histogram of residuals, we can gain insights into how well our model is performing and where it may need improvement.
Optimization
After evaluating the performance of the initial model, there are several steps that can be taken to try to improve its accuracy. One approach is to experiment with different regression algorithms and see if they perform better than the linear regression model. Some possible algorithms to try include decision trees, random forests, and neural networks.
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
# Initialize different models
dt_model = DecisionTreeRegressor()
rf_model = RandomForestRegressor()
nn_model = MLPRegressor()
# Train and evaluate each model
dt_model.fit(X_train, y_train)
dt_preds = dt_model.predict(X_test)
dt_rmse = mean_squared_error(y_test, dt_preds, squared=False)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)
rf_rmse = mean_squared_error(y_test, rf_preds, squared=False)
nn_model.fit(X_train, y_train)
nn_preds = nn_model.predict(X_test)
nn_rmse = mean_squared_error(y_test, nn_preds, squared=False)
print("Decision Tree RMSE: ", dt_rmse)
print("Random Forest RMSE: ", rf_rmse)
print("Neural Network RMSE: ", nn_rmse)
For me, the Random Forest seems to be working better than the other algorithms, for this problem. Indeed, if we update our plots with the predictions of the Random Forest model, we get these charts:


Another approach is to experiment with different hyperparameters for the linear regression model. For example, we could try adjusting the learning rate or the number of iterations for gradient descent. We could also experiment with adding regularization terms to the loss function to help prevent overfitting.
from sklearn.linear_model import SGDRegressor
# Initialize linear regression model with different hyperparameters
lr_model_1 = SGDRegressor(eta0=0.001, max_iter=1000)
lr_model_2 = SGDRegressor(eta0=0.01, max_iter=10000, penalty='l2')
# Train and evaluate each model
lr_model_1.fit(X_train, y_train)
lr_preds_1 = lr_model_1.predict(X_test)
lr_rmse_1 = mean_squared_error(y_test, lr_preds_1, squared=False)
lr_model_2.fit(X_train, y_train)
lr_preds_2 = lr_model_2.predict(X_test)
lr_rmse_2 = mean_squared_error(y_test, lr_preds_2, squared=False)
print("Linear Regression 1 RMSE: ", lr_rmse_1)
print("Linear Regression 2 RMSE: ", lr_rmse_2)
In this case, we get very large prediction errors. The reason for this is that the model is likely overfitting the training data, resulting in very large errors when predicting the test set.
In addition to algorithm and hyperparameter tuning, we could also try to engineer new features from the existing data. For example, we could create new features by combining or transforming existing features, or we could try to gather new data that might be more informative for the target variable.
To ensure that we are making progress in our optimization efforts, we should also set aside a validation set in addition to the training and test sets. We can use the validation set to evaluate the performance of different models and hyperparameters and select the ones that perform best before testing them on the final test set.
Final Note
I think this article is a good recap of many concepts we’ve seen so far. You should be able to solve a lot of problems using this approach.
If you don’t want to miss the other articles of this series, be sure to follow me!
To explore the other stories of this story, click below!
To explore more of my Python stories, click here! You can also access all my content by checking this page.
If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!
If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link: