Summary

The author improved their score in Kaggle's House Price competition by implementing advanced regression techniques, including imputation, scaling, outlier removal, feature selection, model selection, parameter tuning, and ensemble methods.

Abstract

In the blog post, the author details their journey to enhance their performance in Kaggle's House Price competition. They outline a series of sophisticated regression strategies, starting with data preprocessing techniques such as imputation of missing values, encoding of categorical features, and standardization of data. The author also emphasizes the importance of outlier removal, feature selection using the SelectKBest method, and data splitting for validation. They experiment with multiple models, including LinearRegression, KNearestNeighbor, and ExtraTrees, and employ GridSearchCV for hyperparameter tuning. The final approach involves using a voting regressor ensemble method to average predictions from the individual models, which results in a significant improvement in the competition score. The author provides visual evidence of their results and shares the code repository for the implemented techniques.

Opinions

The author believes in the importance of employing a variety of advanced regression techniques to improve model performance.
They suggest that iterative imputation and careful handling of missing data are crucial for regression tasks.
The use of a voting regressor ensemble is presented as an effective method for achieving better predictions in regression competitions.
The author values the experimental functions provided by libraries like sklearn, indicating a preference for cutting-edge tools in data science.
They highlight the significance of visualizing data and results, as shown by the inclusion of multiple graphs and heatmaps in the post.
The author's approach to model selection and parameter tuning reflects a systematic and thorough methodology in machine learning.
They express confidence in the impact of their techniques by showcasing the improved score on Kaggle's leaderboard.

How I improved my score on Kaggle’s House Price competition using advanced regression techniques

In my last post I discussed how to improve the score on Kaggle’s Titanic competition by utilising advanced classification techniques. The link to this post can be found here:-https://readmedium.com/how-i-improved-my-score-on-kaggles-titanic-competition-by-using-advanced-classification-techniques-a2f5f63f7194

In this post I intend to improve the score on Kaggle’s House Price competition by using advanced regression techniques. The techniques I tend to employ are a little different from those I used in my last post, so it is therefore important to read the post to the very end to find out what those differences are.

The link to the Kaggle House Price competition can be found here:- https://www.kaggle.com/c/house-prices-advanced-regression-techniques

As can be seen from the screenshot below, I was able to improve the score on Kaggle’s House Prices competition over 1 point by employing a few advanced regression techniques, to include imputation, scaling, removing outliers, feature selection, model selection, parameter tuning, and ensembles:-

I created the program in Kaggle’s free online Jupyter Notebook and stored it in the House Price repository. The problem statement for the House Price competition can be seen below:-

Once I created the program in the Jupyter Notebook, I imported the libraries that I would need to execute the program. I generally only import libraries as I need them, so on this occasion I imported pandas, numpy, sklearn, matplotlib and seaborn. Pandas creates and manipulates dataframes, numpy creates arrays and carries out algebraic computations, sklearn houses the copious machine learning functions, and matplotlib and seaborn are graphics libraries that plot the various data points.

I then used the os library to go into the operating system and retrieve the csv files that would be used in the program:-

I defined the variable, target, which is composed of the sale price of the properties.

I then used seaborn to plot the sale price of the properties:-

I then created a heatmap, which reveals the correlation that all of the numerical features have to each other:-

I dropped the column, ‘SalePrice’, from a copy of the train file and then appended the test file to that file to create the file, combi, which will be used as the basis to make predictions on:-

I then dropped the column, ‘Id’, from combi because it is not necessary to make predictions on the data:-

The first advanced regression technique I employed was to use a for loop to impute all of the missing values in the dataset, being objects, integers, or floats. The floats were imputed using sklearn’s IterativeImputer, which as an experimental function:-

Once the missing values had been imputed, I used sklearn’s OrdinalEncoder to encode all of the features that were data type object:-

The second advanced regression technique that I employed was to standardise the data, to ensure all of the data points were in close proximity to each other so it will be easier for the model to make predictions on it:-

The third advanced regression technique I used was to remove the outliers from the training set. I initially accomplished this by creating the variables of long_train and long_target to keep the train and test datasets separate from each other:-

I then set up the lower and higher limits and filtered them out of the training dataset:-

Once the outliers had been filtered out, I defined the y variable and then dropped it from the train dataset:-

I then appended the filtered test file to the filtered train file to come up with a filtered combi file:-

Once the combi_filtered file had been established, I defined the X and X_test variables:-

The fourth advanced regression technique I used was to use sklearn’s SelectKBest method to choose the 20 best features for the model to make predictions on:-

I then used sklearn to split the X and y variables into training and testing datasets:-

The fifth advanced regression technique I employed was model selection. For this purpose I chose three models to make predictions on. In addith, the sixth advanced regression technique I employed was to use sklearn’s GridSearchCV to select the correct parameters to tune the model to, but I have not included the code in this post.

The first model I selected was sklearn’s LinearRegression model, for which I scored 80% accuracy when I trained and fitted the model:-

When I predicted on the validation set, I achieved an accuracy of 81% and an error of 27879:-

The second model I selected was sklearn’s KNearesrNeighbor, and I scored 99.99% accuracy when I trained and fitted the model:-

When I predicted on the validation set, I achieved an accuracy of 83.36% and an error of 26364:-

The third model I selected was ExtraTrees. I achieved an accuracy of 99.99% when I trained and fitted the model:-

When I predicted on the validation set, I achieved an accuracy of 84.5% and an error of 25423:-

The seventh advanced regression technique I employed was to choose an ensemble method. In this instance I chose a voting regressor, which combined the three models I had previously used and averages those predictions to form one final prediction. When I employed this technique I scored 97.85% accuracy when I trained and fitted the model:-

When I predicted on the validation set, I achieved an accuracy of 85.42% and an error of 24673. Because this is a regression problem, the error is what is looked at. Therefore, the lower the error is, the better the score is:-

I then used matplotlib to plot a graph of how the predicted values compare to the actual values:-

Once the validation set had been successfully predicted on, I predicted on the X_test set and prepared the predictions to be submitted to Kaggle for scoring:-

In this instance I scored about one point better than my previous submissions. This reveals that by employing advanced regression techniques, it is possible to improve one’s score when entering Kaggle competitions.

The code for this post can be found in my personal Kaggle account, the link being here:- https://www.kaggle.com/tracyporter/r9-house-prices-udemy-improve/notebook?scriptVersionId=77678320