Talk data to me — Modeling of Singapore Grand Prix 2018


The Marina Bay Street Circuit, also known as the “Singapore Street Circuit” is a street circuit around Singapore’s Marina Bay, which has been hosting the Singapore Grand Prix every year since 2008. Due to Singapore’s yearly high temperatures, its humidity and the uneven nature of the street track and the heavy braking zones, the Singapore Circuit represents a physical challenge for both the drivers and cars, and it is probably the most demanding race on the F1 calendar. Moreover, it is the longest race of Formula One Grand Prix season, taking up to two hours to complete.
Background
Having read the 2018 world cup predictions reports by various banks, I decided to embark on my own journey to do my own data analysis and predictive modeling of the upcoming F1 race in Singapore. This short article will cover the challenges faced, lessons learnt and of course the results of the models and predictions. It will be split it into 5 sections — data mining, data processing and exploration, feature engineering, model building and the final results. This side project was mainly done after office hours, and thus I did not have much time for it. The plan was to come up with a detailed report by start of qualifications, but this is what I have currently, and plan to improve it even after the Singapore Grand Prix is over.
Data Mining
Mainly, I found two amazing free-for-all sites which gave me the necessary data I needed to work with, the Ergast F1 data repository, and of course the official Formula1 website. Both data sources have essentially the same data, but having more than one source of data is excellent as I was able to cross validate and ensure data accuracy, which is very important in any data science project. However, I was initially despondent when I saw the data on the official site, which I was planning to rely heavily on. It had no good API end points, or downloadable links which I can easily dump the data locally. Therefore, I spent some time coding out a spider to crawl the official F1 site automatically, format it nicely for me, and dump it in a CSV file which I can work on. It was a pain in the arse to code it out, as there were many caveats I had to consider for the site. Each website is different, and thus I needed to go through the front end code to really pry it open and extract the data from within. Looking back, building a custom scraper for it was too time-consuming, and not worth the effort if I am not coming back for data often, which is in this case as F1 races are only annually. Web scrapers are most worth the effort if the data is updated on a high frequency, and you want to have the freshest data mined periodically. I can go on to talk about the many challenges I faced even in this data mining process, but it would too long. I plan to attach/upload the spider so you can take a look at the source code and run it yourself.

To prevent data leakage and achieve model viability, I only used features that will be available to public before the final race, such as practice timings, qualification timings, current standing, constructor/driver stats and weather data. Final features that was considered are:
'driver', 'constructor', 'year', 'p1_time', 'p2_time', 'p3_time', 'qualifying_q1', 'qualifying_q2', 'qualifying_q3', 'start_position', 'driver_pos', 'driver_wins', 'driver_points', 'constructor_pos', 'constructor_wins', 'constructor_points', 'T-2 Daily Rainfall Total (mm)', 'T-2 Mean Temperature (°C)', 'T-2 Maximum Temperature (°C)', 'T-2 Minimum Temperature (°C)', 'T-2 Mean Wind Speed (km/h)', 'T-2 Max Wind Speed (km/h)', 'T-1 Daily Rainfall Total (mm)', 'T-1 Mean Temperature (°C)', 'T-1 Maximum Temperature (°C)', 'T-1 Minimum Temperature (°C)', 'T-1 Mean Wind Speed (km/h)', 'T-1 Max Wind Speed (km/h)', 'T-0 Daily Rainfall Total (mm)', 'T-0 Mean Temperature (°C)', 'T-0 Maximum Temperature (°C)', 'T-0 Minimum Temperature (°C)', 'T-0 Mean Wind Speed (km/h)', 'T-0 Max Wind Speed (km/h)', 'T-0 Humidity', 'factor_q1', 'factor_q2', 'factor_q3'

Exploratory Data Analysis
After gathering the data, I deciding on the following data points to be included in the machine learning model. My final data set consists of 3 sections. First section is primary data set such as qualification and practice timing,
'final_time', 'p1_time','p2_time', 'p3_time', 'qualifying_q1', 'qualifying_q2', 'qualifying_q3',
2nd part is weather data such as weather numbers for all 3days,
‘T-0 Humidity’,
‘T-0 Max Wind Speed (km/h)’, ‘T-0 Maximum Temperature (°C)’,
‘T-0 Mean Temperature (°C)’, ‘T-0 Mean Wind Speed (km/h)’,
‘T-0 Minimum Temperature (°C)’, ‘T-1 Daily Rainfall Total (mm)’,
‘T-1 Max Wind Speed (km/h)’, ‘T-1 Maximum Temperature (°C)’,
‘T-1 Mean Temperature (°C)’, ‘T-1 Mean Wind Speed (km/h)’,
‘T-1 Minimum Temperature (°C)’, ‘T-2 Daily Rainfall Total (mm)’,
‘T-2 Max Wind Speed (km/h)’, ‘T-2 Maximum Temperature (°C)’,
‘T-2 Mean Temperature (°C)’, ‘T-2 Mean Wind Speed (km/h)’,
‘T-2 Minimum Temperature (°C)’and lastly the third part will be proxy psychological data, that includes the drivers and teams current standing data. I hope this would capture confidence level, and current aptitude to a certain degree.
'constructor_points', 'constructor_pos',
'constructor_wins', 'driver_points', 'driver_pos', 'driver_wins',
'factor_q1', 'factor_q2', 'factor_q3', 'start_position'These are the free data I have found within my limited time. Since I have only 10 years of Singapore Grand Prix data, and each race has ~20 drivers each, altogether I would have ~200 rows, which is extremely extremely small for a data science project. Machine learning models work well with large data sets, which goes minimally to around 100k+ rows/observation. Unless I scope it in another manner, I have to make do with 200 rows. I supplemented it with getting as many features as possible, and using simple machine learning algorithms such as decision trees K-nearest neighbours which handles small datasets well. If I had the time, I would try Gaussian Mixture Model — GMM or a Kernel Density Estimation — KDE model to fit to the data, and maybe create more sample points according to the distribution to supplement the dataset.
After much data mangling and processing, (I can attest that 80% of a data scientist’s time is spend collecting and cleaning data to be fed into a model), we have to do some exploratory data analysis, to get a feel of the data, and thus having a clearer idea of what we are working with.
First we plot out the target viable, which is final_timing in the regressions case. We can see that the final timing for the drivers are not normally distributed, bimodal in fact, which in this case traditional statistical modeling may not perform as well as machine learning algorithms. We need to use a model that has as little prior assumptions as possible, remember Oscam’s razor principle from philosophy.


In brevity, I did a few charts to get a ‘feel’ of the data, and this are a few interesting looking ones that tell alot about the data. Can you find some interesting points and tell a story from them?
g = sns.PairGrid(df, vars=['final_time', 'p1_time', 'p2_time', 'p3_time','qualifying_q1','qualifying_q2','qualifying_q3'],
hue='year')
g.map(plt.scatter, alpha=0.8)
We can see from above that there a few outliers, see the first row light blue dots. They are data from 2017, which shows that they perform really well in the practice and qualification races, and did really badly in the final race, which does not make sense. After some investigation, I have yet to understand the cause, if any of you have any ideas do let me know. I was considering whether to remove the outliers, as I have to weigh in the benefits of removing, against the cost of an even smaller dataset. I decided to leave it in, as there were sufficient variance in the other independent variables to prove sufficient hidden pattern use case.
The next diagram shows the pair wise features according to the drivers, as color hues.

Below is the correlation matrix that calculate the covariance among features. A covariance matrix C should be positive (semi-)definite and hence satisfies |Cij|2 ≤ Cii Cjj for all indices i, j. Consequently, the absolute values of the entries of the corresponding correlation matrix do not exceed 1. It seems like practice 3 timings as the highest correlation with our target variable.

Feature Engineering
The main consideration here for feature engineering is the values for missing qualification 1, qualification 2, and qualification 3 timings. The way qualification stages in Formula One work is that everyone will be racing for the fastest lap in each stage, and the slowest few will be cut off from the next stage. Q3 will then be left with 10 of the 20 drivers and they will be racing for pole position. Hence, there will be missing values for Q2 and Q3 timings. The issue now is how to deal with them. I considered both imputing 0, and the mean value of their qualification timings. I chose to stick with the latter since it gives better CV results. I then engineered dummy variables to indicate if they have qualified for each round so as to differentiate the good timings from the mean imputed ones that may indicate a false sense of caliber.
Reliability of CV in our special small dataset case
(Reserved for later)
Model building
Here comes the fun part. My total dataset will be split into 3 parts, training set, validation set and testing set. 80% will be kept and trained on, to help with hyper-parameter of the top few machine learning model that performs well on this current dataset. 20% will then be kept for the testing phase which will help me decide on the final model to use. You can imagine how much smaller my dataset will become after setting aside data. Woes of a small dataset. I thus expect the whole modelling process to either overfit the little data, or generalize too much with low variance without capturing sufficient information pattern, if I use too much regularization. Below are some snippets of code for model building, with their scores and performance metrics.


