avatarHasan Basri Akçay

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

12329

Abstract

="hljs-selector-attr">[team]</span><span class="hljs-selector-class">.values</span>

Series = TimeSeries<span class="hljs-selector-class">.from_dataframe</span>(train_test, <span class="hljs-string">'range'</span>, <span class="hljs-string">'y'</span>)

train, val = Series<span class="hljs-selector-class">.split_before</span>(pd<span class="hljs-selector-class">.Timestamp</span>(<span class="hljs-built_in">len</span>(Series) - test_size))

model = darts<span class="hljs-selector-class">.models</span><span class="hljs-selector-class">.RandomForest</span>(lags=<span class="hljs-number">1</span>)
model<span class="hljs-selector-class">.fit</span>(train)
prediction = model<span class="hljs-selector-class">.predict</span>(<span class="hljs-built_in">len</span>(val))
rmse = <span class="hljs-built_in">mean_squared_error</span>(val<span class="hljs-selector-class">.values</span>(), prediction<span class="hljs-selector-class">.values</span>(), squared=False)
total_rmse += rmse

<span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">'Darts RandomForest RMSE: '</span>, total_rmse/len(df_athletics[<span class="hljs-string">'Team'</span>].unique()</span></span>))</pre></div><p id="d957"><b>2.5 Darts — LightGBMModel</b></p><div id="e9ec"><pre>total_rmse = <span class="hljs-number">0</span> <span class="hljs-keyword">for</span> team <span class="hljs-keyword">in</span> df_athletics<span class="hljs-selector-attr">[<span class="hljs-string">'Team'</span>]</span><span class="hljs-selector-class">.unique</span>(): train_test = pd<span class="hljs-selector-class">.DataFrame</span>() train_test<span class="hljs-selector-attr">[<span class="hljs-string">'ds'</span>]</span> = df_athletics_timeseries<span class="hljs-selector-class">.index</span> train_test<span class="hljs-selector-attr">[<span class="hljs-string">'range'</span>]</span> = np<span class="hljs-selector-class">.arange</span>(<span class="hljs-built_in">len</span>(df_athletics_timeseries)) train_test<span class="hljs-selector-attr">[<span class="hljs-string">'y'</span>]</span> = df_athletics_timeseries<span class="hljs-selector-attr">[team]</span><span class="hljs-selector-class">.values</span>

Series = TimeSeries<span class="hljs-selector-class">.from_dataframe</span>(train_test, <span class="hljs-string">'range'</span>, <span class="hljs-string">'y'</span>)

train, val = Series<span class="hljs-selector-class">.split_before</span>(pd<span class="hljs-selector-class">.Timestamp</span>(<span class="hljs-built_in">len</span>(Series) - test_size))

model = darts<span class="hljs-selector-class">.models</span><span class="hljs-selector-class">.LightGBMModel</span>(lags=<span class="hljs-number">1</span>)
model<span class="hljs-selector-class">.fit</span>(train)
prediction = model<span class="hljs-selector-class">.predict</span>(<span class="hljs-built_in">len</span>(val))
rmse = <span class="hljs-built_in">mean_squared_error</span>(val<span class="hljs-selector-class">.values</span>(), prediction<span class="hljs-selector-class">.values</span>(), squared=False)
total_rmse += rmse

<span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">'Darts LightGBMModel RMSE: '</span>, total_rmse/len(df_athletics[<span class="hljs-string">'Team'</span>].unique()</span></span>))</pre></div><p id="695a"><b>2.6 Darts — Baseline Models</b></p><div id="c34a"><pre>total_rmse = <span class="hljs-number">0</span> <span class="hljs-keyword">for</span> team <span class="hljs-keyword">in</span> df_athletics<span class="hljs-selector-attr">[<span class="hljs-string">'Team'</span>]</span><span class="hljs-selector-class">.unique</span>(): train_test = pd<span class="hljs-selector-class">.DataFrame</span>() train_test<span class="hljs-selector-attr">[<span class="hljs-string">'ds'</span>]</span> = df_athletics_timeseries<span class="hljs-selector-class">.index</span> train_test<span class="hljs-selector-attr">[<span class="hljs-string">'range'</span>]</span> = np<span class="hljs-selector-class">.arange</span>(<span class="hljs-built_in">len</span>(df_athletics_timeseries)) train_test<span class="hljs-selector-attr">[<span class="hljs-string">'y'</span>]</span> = df_athletics_timeseries<span class="hljs-selector-attr">[team]</span><span class="hljs-selector-class">.values</span>

Series = TimeSeries<span class="hljs-selector-class">.from_dataframe</span>(train_test, <span class="hljs-string">'range'</span>, <span class="hljs-string">'y'</span>)

train, val = Series<span class="hljs-selector-class">.split_before</span>(pd<span class="hljs-selector-class">.Timestamp</span>(<span class="hljs-built_in">len</span>(Series) - test_size))

model = darts<span class="hljs-selector-class">.models</span><span class="hljs-selector-class">.baselines</span><span class="hljs-selector-class">.NaiveDrift</span>()
model<span class="hljs-selector-class">.fit</span>(train)
prediction = model<span class="hljs-selector-class">.predict</span>(<span class="hljs-built_in">len</span>(val))
rmse = <span class="hljs-built_in">mean_squared_error</span>(val<span class="hljs-selector-class">.values</span>(), prediction<span class="hljs-selector-class">.values</span>(), squared=False)
total_rmse += rmse

<span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">'Darts LightGBMModel RMSE: '</span>, total_rmse/len(df_athletics[<span class="hljs-string">'Team'</span>].unique()</span></span>))</pre></div><h1 id="6f5d">3. AutoTS</h1><p id="c031">AutoTS is an open-source library that is used to automate time series forecasting. It supports multiple variable time series forecasting. You can find more info from <a href="https://winedarksea.github.io/AutoTS/build/html/source/tutorial.html">https://winedarksea.github.io/AutoTS/build/html/source/tutorial.html</a>.</p><div id="636b"><pre><span class="hljs-attribute">df_athletics_timeseries</span>.index = pd.date_range(start='<span class="hljs-number">2021</span>-<span class="hljs-number">04</span>-<span class="hljs-number">01</span>', end='<span class="hljs-number">2021</span>-<span class="hljs-number">04</span>-<span class="hljs-number">29</span>', periods=len(df_athletics_timeseries)) <span class="hljs-attribute">total_rmse</span> = <span class="hljs-number">0</span> <span class="hljs-attribute">for</span> team in df_athletics['Team'].unique(): <span class="hljs-attribute">train</span> = df_athletics_timeseries[:-test_size] <span class="hljs-attribute">test</span> = df_athletics_timeseries[-test_size:]

<span class="hljs-attribute">model</span> = AutoTS()
<span class="hljs-attribute">model</span>.fit(train, series_column_name=team)
<span class="hljs-attribute">preds</span> = model.predict(start=pd.to_datetime('<span class="hljs-number">2021</span>-<span class="hljs-number">04</span>-<span class="hljs-number">28</span> <span class="hljs-number">00</span>:<span class="hljs-number">00</span>:<span class="hljs-number">00</span>'), end=pd.to_datetime('<span class="hljs-number">2021</span>-<span class="hljs-number">04</span>-<span class="hljs-number">29</span> <span class="hljs-number">00</span>:<span class="hljs-number">00</span>:<span class="hljs-number">00</span>'))
<span class="hljs-attribute">rmse</span> = mean_squared_error(test[team], preds, squared=False)
<span class="hljs-attribute">total_rmse</span> += rmse

<span class="hljs-attribute">print</span>('AutoTS RMSE: ', total_rmse/len(df_athletics['Team'].unique()))</pre></div><h1 id="9009">4. Arima</h1><p id="caa1">Arima (Autoregressive Integrated Moving Average) is a statistical analysis model. It can be used better understand the data set or to predict future trends. You can find more info from <a href="https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html">https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html</a></p><div id="09c4"><pre>total_rmse = 0 <span class="hljs-keyword">for</span> team <span class="hljs-keyword">in</span> df_athletics[<span class="hljs-string">'Team'</span>].unique(): train_test = pd.DataFrame() train_test[<span class="hljs-string">'ds'</span>] = df_athletics_timeseries.index train_test[<span class="hljs-string">'y'</span>] = df_athletics_timeseries[team].values

train = train_test[:-test_size]
test = train_test[-test_size:]

stepwise_fit = auto_arima(train[<span class="hljs-string">'y'</span>], <span class="hljs-attribute">trace</span>=<span class="hljs-literal">False</span>, <span class="hljs-attribute">suppress_warning</span>=<span class="hljs-literal">True</span>)
model = ARIMA(train[<span class="hljs-string">'y'</span>], <span class="hljs-attribute">order</span>=stepwise_fit.order)
model_fit = model.fit()
preds = model_fit.forecast(test_size)
rmse = mean_squared_error(test[<span class="hljs-string">'y'</span>], preds, <span class="hljs-attribute">squared</span>=<span class="hljs-literal">False</span>)
total_rmse += rmse

<span class="hljs-built_in">print</span>(<span class="hljs-string">'Arima RMSE: '</span>, total_rmse/len(df_athletics[<span class="hljs-string">'Team'</span>].unique()))</pre></div><h1 id="0b61">5. Sarimax</h1><p id="4276">Sarimax (Seasonal ARIMA) is a statistical analysis model. The difference between arima and sarima is sarima supports seasonality handling. We also used sarimax for data understanding in part 2. You can find more info about sarimax from <a href="https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html">https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html</a></p><div id="7716"><pre>total_rmse = 0 <span class="hljs-keyword">for</span> team <span class="hljs-keyword">in</span> df_athletics[<span class="hljs-string">'Team'</span>].unique(): train_test = pd.DataFrame() train_test[<span class="hljs-string">'ds'</span>] = df_athletics_timeseries.index train_test[<span class="hljs-string">'y'</span>] = df_athletics_timeseries[team].values

train = train_test[:-test_size]
test = train_test[-test_size:]

stepwise_fit = auto_arima(train[<span class="hljs-string">'y'</span>], <span class="hljs-attribute">trace</span>=<span class="hljs-literal">False</span>, <span class="hljs-attribute">suppress_warning</span>=<span class="hljs-literal">True</span>)
model = SARIMAX(train[<span class="hljs-string">'y'</span>], <span class="hljs-attribute">order</span>=stepwise_fit.order)
model_fit = model.fit()
preds = model_fit.forecast(test_size)
rmse = mean_squared_error(test[<span class="hljs-string">'y'</span>], preds, <span class="hljs-attribute">squared</span>=<span class="hljs-literal">False</span>)
total_rmse += rmse

<span class="hljs-built_in">print</span>(<span class="hljs-string">'Sarimax RMSE: '</span>, total_rmse/len(df_athletics[<span class="hljs-string">'Team'</span>].unique()))</pre></div><div id="5222" class="link-block"> <a href="https://readmedium.com/come-write-with-us-as-a-dataprofessional-or-enthusiast-771a641c4a48"> <div> <div> <h2>Come Write With Us as a “DataProfessional” or “Enthusiast”</h2> <div><h3>About Data Myths and Facts</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*Wsb7Biq1T3RXRCf2tK3hxg.png)"></div> </div> </div> </a> </div><h1 id="9a42">6. Monte Carlo Simulation</h1><p id="8f84">Monte Carlo Simulation is a forecasting model that is used for forecasting cannot easily be predictable due to the intervention of random variables. Basically it predicts randomly many times (simulation number) according to data properties such as standard deviation, variance. Then it selects the best fit for forecasting.</p><div id="ee8b"><pre>simulation_num = <span class="hljs-number">500</span> days_to_test = <span class="hljs-num

Options

ber">27</span> days_to_predict = <span class="hljs-number">2</span> total_rmse = <span class="hljs-number">0</span> for team <span class="hljs-keyword">in</span> df_athletics[<span class="hljs-string">'Team'</span>].unique(): train_test = pd.DataFrame() train_test[<span class="hljs-string">'ds'</span>] = df_athletics_timeseries.index train_test[<span class="hljs-string">'y'</span>] = df_athletics_timeseries[team].values

train = train_test[:-test_size]
test = train_test[-test_size:]

########### Monte Carlo

daily_return = np.log(<span class="hljs-number">1</span> + train[<span class="hljs-string">'y'</span>].pct_change())
daily_return.replace([np.inf, -np.inf], <span class="hljs-number">0</span>, inplace=<span class="hljs-literal">True</span>)
daily_return.replace(np.nan, <span class="hljs-number">0</span>, inplace=<span class="hljs-literal">True</span>)
average_daily_return = daily_return.mean()
variance = daily_return.var()
drift = average_daily_return - (variance/<span class="hljs-number">2</span>)
standard_deviation = daily_return.std()

predictions = np.zeros(days_to_test+days_to_predict)
predictions[<span class="hljs-number">0</span>] = train[<span class="hljs-string">'y'</span>][<span class="hljs-number">0</span>]
pred_collection = np.ndarray(shape=(simulation_num, days_to_test+days_to_predict))

for j <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, simulation_num):
    for i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,days_to_test+days_to_predict):
        random_value = standard_deviation * norm.ppf(np.random.rand())
        predictions[i] = predictions[i<span class="hljs-number">-1</span>] * np.exp(drift + random_value)
    pred_collection[j] = predictions
    
differences = np.array([])
for k <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, simulation_num):
    difference_arrays = np.subtract(train[<span class="hljs-string">'y'</span>].values[-days_to_test:], pred_collection[k][-days_to_test:])
    difference_values = np.sum(np.abs(difference_arrays))
    differences = np.append(differences,difference_values)

best_fit = np.argmin(differences)
best_pred = pred_collection[best_fit]

###########

rmse = mean_squared_error(test[<span class="hljs-string">'y'</span>], best_pred[-days_to_predict:], squared=<span class="hljs-literal">False</span>)
total_rmse += rmse

print(<span class="hljs-string">'Monto Carlo Simulation RMSE: '</span>, total_rmse/len(df_athletics[<span class="hljs-string">'Team'</span>].unique()))</pre></div><h1 id="f049">7. Mean Prediction</h1><p id="d72b">Mean prediction means is predict always train data to mean value. It is useful for the baseline model. If machine learning models have a higher score than mean prediction, we can say that predictions are not random.</p><div id="35ec"><pre>total_rmse = <span class="hljs-number">0</span> <span class="hljs-keyword">for</span> team <span class="hljs-keyword">in</span> df_athletics<span class="hljs-selector-attr">[<span class="hljs-string">'Team'</span>]</span><span class="hljs-selector-class">.unique</span>(): train_test = pd<span class="hljs-selector-class">.DataFrame</span>() train_test<span class="hljs-selector-attr">[<span class="hljs-string">'ds'</span>]</span> = df_athletics_timeseries<span class="hljs-selector-class">.index</span> train_test<span class="hljs-selector-attr">[<span class="hljs-string">'y'</span>]</span> = df_athletics_timeseries<span class="hljs-selector-attr">[team]</span><span class="hljs-selector-class">.values</span>

train = train_test<span class="hljs-selector-attr">[:-test_size]</span>
test = train_test<span class="hljs-selector-attr">[-test_size:]</span>

pred = train<span class="hljs-selector-attr">[<span class="hljs-string">'y'</span>]</span><span class="hljs-selector-class">.mean</span>()

rmse = <span class="hljs-built_in">mean_squared_error</span>(test<span class="hljs-selector-attr">[<span class="hljs-string">'y'</span>]</span>, <span class="hljs-selector-attr">[pred, pred]</span>, squared=False)
total_rmse += rmse

<span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">'Mean Prediction RMSE: '</span>, total_rmse/len(df_athletics[<span class="hljs-string">'Team'</span>].unique()</span></span>))</pre></div><h1 id="2e1a">8. Results</h1><p id="a8d9">You can see the scores in below. We did not do hyperparameter tuning for this work except arima and sarimax and we just used the default parameters of the models. With hyperparameter tuning, scores can be better.</p><p id="5ba6">Fbprophet RMSE: — — — — — — — — — 3.180356233094025 Darts FFT RMSE: — — — — — — — — — 1.7806018079168815 Darts ExponentialSmoothing RMSE: — 2.125171341043955 Darts RegressionModel RMSE: — — — -1.530246244439605 Darts RandomForest RMSE: — — — —- 1.5932785912162637 Darts LightGBMModel RMSE: — — — — 1.6763816820500097 AutoTS RMSE: — — — — — — — — — — 1.4205065327167645 Arima RMSE: — — — — — — — — — —- 1.5376332124117644 Sarimax RMSE: — — — — — — — — — 1.8889240559227485 Monto Carlo Simulation RMSE: — — — 1.769909711992457 Mean Prediction RMSE: — — — — — — 1.6757768602609089</p><h1 id="1a8f">Discussion</h1><p id="7a7d">According to the result, four models have a higher score than the mean prediction. They are Darts RegressionModel, Darts RandomForest, AutoTS Arima and AutoTS that have the highest score.</p><p id="41b2">Firstly, when we compare arima and sarimax scores, arima has higher score than sarimax. The reason of this, there is no seasonality affect on medal numbers. For this reason, fbprophet has bad score because the seasonality affect is not closed in default parameters of fbprophet. After closing seasonality affect, fbprophet score is 1.67.</p><p id="168b">The number of medals a country has won at the Olympics depends on the number of medals won by another country. So the models that has support multiable input variable, has the advantage for this forecasting. That is why AutoTS has best score for this problem. Don’t forget, If we work with another dataset that seasonality affect on and it has just one input variable, the result can be change.</p><p id="4264">👋 Thanks for reading. If you enjoy my work, don’t forget to like, follow me <a href="https://medium.com/@hasan.basri.akcay">on medium</a> and <a href="https://www.linkedin.com/in/hasan-basri-akcay/">on LinkedIn</a>. It will motivate me in offering more content to the Medium community ! 😊</p><div id="d6e2" class="link-block"> <a href="https://www.linkedin.com/in/hasan-basri-akcay/"> <div> <div> <h2>Hasan Basri Akçay - Data Engineer - İnelso Energy Systems | LinkedIn</h2> <div><h3>View Hasan Basri Akçay's profile on LinkedIn, the world's largest professional community. Hasan Basri has 5 jobs listed…</h3></div> <div><p>www.linkedin.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*WHj4LDKmtwyGG6Hk)"></div> </div> </div> </a> </div><div id="87cc" class="link-block"> <a href="https://readmedium.com/olympic-medal-numbers-predictions-with-timeseries-9bec2d4c812b"> <div> <div> <h2>Olympic Medal Numbers Predictions with Time Series, Part 1: Data Cleaning</h2> <div><h3>Fbprophet, Darts, AutoTS, Arima, Sarimax and Monte Carlo Simulation</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*c48pN1nf2BYyPIXPM5UydA.jpeg)"></div> </div> </div> </a> </div><div id="7fde" class="link-block"> <a href="https://readmedium.com/olympic-medal-numbers-predictions-with-timeseries-part-2-data-analysis-5d5d7e38fc37"> <div> <div> <h2>Olympic Medal Numbers Predictions with Time Series, Part 2: Data Analysis</h2> <div><h3>Fbprophet, Darts, AutoTS, Arima, Sarimax and Monte Carlo Simulation</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*c48pN1nf2BYyPIXPM5UydA.jpeg)"></div> </div> </div> </a> </div><div id="3342" class="link-block"> <a href="https://readmedium.com/why-are-central-banks-trying-to-create-their-cryptocurrencies-5d333476da20"> <div> <div> <h2>Why Are Central Banks Trying To Create Their Cryptocurrencies?</h2> <div><h3>The coronavirus outbreak has accelerated the cryptocurrency’s exit from the marginal state and pushed it to the center…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*c30LVyKT_NAsmpL0)"></div> </div> </div> </a> </div><div id="5b9a" class="link-block"> <a href="https://readmedium.com/overview-of-neural-networks-84382d068d78"> <div> <div> <h2>Overview of Neural Networks</h2> <div><h3>Artificial Intelligence is so popular in our world. If you are interested in AI, You would probably know the existence…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*9ilr8tP4gx1ch1NL)"></div> </div> </div> </a> </div><div id="ed4f" class="link-block"> <a href="https://readmedium.com/basic-linux-commands-to-check-hardware-and-system-information-62a4436d40db"> <div> <div> <h2>Basic Linux Commands to Check Hardware and System Information</h2> <div><h3>Once the Linux kernel initializes, it enumerates all hardware components. There are plenty of commands to check…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*Q25o574SYAYZnMOH)"></div> </div> </div> </a> </div><div id="afd3" class="link-block"> <a href="https://readmedium.com/how-to-give-your-customers-a-payment-option-with-cryptocurrency-in-2021-8b99c9b3c57a"> <div> <div> <h2>How To Give Your Customers A Payment Option With Cryptocurrency In 2021</h2> <div><h3>Simple step by step with images to install payment with cryptocurrency to your woo-commerce so that you can start…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*FUAufS9RkYDUtJzBkFLELQ.png)"></div> </div> </div> </a> </div><h1 id="90ea">References:</h1><p id="19d0">[1]: <a href="https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results">https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results</a> [2]: <a href="https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics">https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics</a> [3]: <a href="https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics">https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics</a> [4]: <a href="https://2001-2009.state.gov/r/pa/ho/time/qfp/104481.htm">https://2001-2009.state.gov/r/pa/ho/time/qfp/104481.htm</a></p></article></body>

Olympic Medal Numbers Predictions with Time Series, Part 3: Time Series Forecasting

Fbprophet, Darts, AutoTS, Arima, Sarimax, and Monte Carlo Simulation

In Part 1, we worked on data cleaning. For example, missing values imputing, dropping constant columns, matching incorrectly spelled words.

In Part 2, we worked on data analysis such as finding trends, checking data distribution, calculating p-values, and controlling predictability. After data analysis, we found important missing values in the 1980 Olympic Games. You can read the details in part 2.

In this part, you can see different time series machine learning models used and their scores in this dataset. Used machine learning models are Fbprophet, Darts, AutoTS, Arima, Sarimax and Monte Carlo Simulation.

Before starting the work, some libraries that are used in forecasting, should be imported. These libraries in below.

from darts import TimeSeries
import darts
from AutoTS.AutoTS import AutoTS
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from scipy.stats import norm

from sklearn.metrics import mean_squared_error

test_size = 2

1. Fbprophet

Fbprophet is an open-source library. It is developed by Facebook for one variable time series forecasting. It supports seasonality and holidays. It has constant column names. If you want to work with fbprophet, you should change time columns name with ‘ds’ and value column name with ‘y’. You can find more info from https://facebook.github.io/prophet/docs/quick_start.html.

Fbprophet is one variable model. For this reason, we forecast medals for each countries and calculate scores by mean squared error.

total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['y'] = df_athletics_timeseries[team].values
    
    train = train_test[:-test_size]
    test = train_test[-test_size:]
    
    model = Prophet(growth='linear')
    model.fit(train)
    future = model.make_future_dataframe(periods=test_size)
    forecast = model.predict(future)
    
    rmse = mean_squared_error(test['y'], forecast['yhat'][-test_size:], squared=False)
    total_rmse += rmse
print('Prophet RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2. Darts

Darts is an open-source library and it is developed by Unit8 for time series forecasting. Darts includes many different machine learning models for time series. In this work, we used FFT (Fast Fourier Transform), ExponentialSmoothing, RegressionModel, RandomForest, LightGBMModel, Baseline Models. Every model has its own features for example FFT does not support multiple variable time series forecasting but Random Forest does. You can find more info from https://unit8co.github.io/darts/.

We forecast medals for each country and calculate scores by mean squared error.

2.1 Darts — FFT

total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['range'] = np.arange(len(df_athletics_timeseries))
    train_test['y'] = df_athletics_timeseries[team].values
    
    Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
    
    train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))
    
    model = darts.models.FFT()
    model.fit(train)
    prediction = model.predict(len(val))
    rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
    total_rmse += rmse
print('Darts FFT RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2.2 Darts — ExponentialSmoothing

total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['range'] = pd.date_range(start='2021-04-01', end='2021-04-29', periods=len(df_athletics_timeseries))
    train_test['y'] = df_athletics_timeseries[team].values
    
    Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
    
    train, val = Series.split_before(pd.Timestamp('2021-04-28'))
    
    model = darts.models.ExponentialSmoothing()
    model.fit(train)
    prediction = model.predict(len(val))
    rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
    total_rmse += rmse
print('Darts ExponentialSmoothing RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2.3 Darts — RegressionModel

total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['range'] = np.arange(len(df_athletics_timeseries))
    train_test['y'] = df_athletics_timeseries[team].values
    
    Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
    
    train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))
    
    model = darts.models.RegressionModel(lags=1)
    model.fit(train)
    prediction = model.predict(len(val))
    rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
    total_rmse += rmse
print('Darts RegressionModel RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2.4 Darts — RandomForest

total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['range'] = np.arange(len(df_athletics_timeseries))
    train_test['y'] = df_athletics_timeseries[team].values
    
    Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
    
    train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))
    
    model = darts.models.RandomForest(lags=1)
    model.fit(train)
    prediction = model.predict(len(val))
    rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
    total_rmse += rmse
print('Darts RandomForest RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2.5 Darts — LightGBMModel

total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['range'] = np.arange(len(df_athletics_timeseries))
    train_test['y'] = df_athletics_timeseries[team].values
    
    Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
    
    train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))
    
    model = darts.models.LightGBMModel(lags=1)
    model.fit(train)
    prediction = model.predict(len(val))
    rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
    total_rmse += rmse
print('Darts LightGBMModel RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

2.6 Darts — Baseline Models

total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['range'] = np.arange(len(df_athletics_timeseries))
    train_test['y'] = df_athletics_timeseries[team].values
    
    Series = TimeSeries.from_dataframe(train_test, 'range', 'y')
    
    train, val = Series.split_before(pd.Timestamp(len(Series) - test_size))
    
    model = darts.models.baselines.NaiveDrift()
    model.fit(train)
    prediction = model.predict(len(val))
    rmse = mean_squared_error(val.values(), prediction.values(), squared=False)
    total_rmse += rmse
print('Darts LightGBMModel RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

3. AutoTS

AutoTS is an open-source library that is used to automate time series forecasting. It supports multiple variable time series forecasting. You can find more info from https://winedarksea.github.io/AutoTS/build/html/source/tutorial.html.

df_athletics_timeseries.index = pd.date_range(start='2021-04-01', end='2021-04-29', periods=len(df_athletics_timeseries))
total_rmse = 0
for team in df_athletics['Team'].unique():
    train = df_athletics_timeseries[:-test_size]
    test = df_athletics_timeseries[-test_size:]
    
    model = AutoTS()
    model.fit(train, series_column_name=team)
    preds = model.predict(start=pd.to_datetime('2021-04-28 00:00:00'), end=pd.to_datetime('2021-04-29 00:00:00'))
    rmse = mean_squared_error(test[team], preds, squared=False)
    total_rmse += rmse
print('AutoTS RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

4. Arima

Arima (Autoregressive Integrated Moving Average) is a statistical analysis model. It can be used better understand the data set or to predict future trends. You can find more info from https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html

total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['y'] = df_athletics_timeseries[team].values
    
    train = train_test[:-test_size]
    test = train_test[-test_size:]
    
    stepwise_fit = auto_arima(train['y'], trace=False, suppress_warning=True)
    model = ARIMA(train['y'], order=stepwise_fit.order)
    model_fit = model.fit()
    preds = model_fit.forecast(test_size)
    rmse = mean_squared_error(test['y'], preds, squared=False)
    total_rmse += rmse
print('Arima RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

5. Sarimax

Sarimax (Seasonal ARIMA) is a statistical analysis model. The difference between arima and sarima is sarima supports seasonality handling. We also used sarimax for data understanding in part 2. You can find more info about sarimax from https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html

total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['y'] = df_athletics_timeseries[team].values
    
    train = train_test[:-test_size]
    test = train_test[-test_size:]
    
    stepwise_fit = auto_arima(train['y'], trace=False, suppress_warning=True)
    model = SARIMAX(train['y'], order=stepwise_fit.order)
    model_fit = model.fit()
    preds = model_fit.forecast(test_size)
    rmse = mean_squared_error(test['y'], preds, squared=False)
    total_rmse += rmse
print('Sarimax RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

6. Monte Carlo Simulation

Monte Carlo Simulation is a forecasting model that is used for forecasting cannot easily be predictable due to the intervention of random variables. Basically it predicts randomly many times (simulation number) according to data properties such as standard deviation, variance. Then it selects the best fit for forecasting.

simulation_num = 500
days_to_test = 27
days_to_predict = 2
total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['y'] = df_athletics_timeseries[team].values
    
    train = train_test[:-test_size]
    test = train_test[-test_size:]
    
    ########### Monte Carlo
    
    daily_return = np.log(1 + train['y'].pct_change())
    daily_return.replace([np.inf, -np.inf], 0, inplace=True)
    daily_return.replace(np.nan, 0, inplace=True)
    average_daily_return = daily_return.mean()
    variance = daily_return.var()
    drift = average_daily_return - (variance/2)
    standard_deviation = daily_return.std()
    
    predictions = np.zeros(days_to_test+days_to_predict)
    predictions[0] = train['y'][0]
    pred_collection = np.ndarray(shape=(simulation_num, days_to_test+days_to_predict))
    
    for j in range(0, simulation_num):
        for i in range(1,days_to_test+days_to_predict):
            random_value = standard_deviation * norm.ppf(np.random.rand())
            predictions[i] = predictions[i-1] * np.exp(drift + random_value)
        pred_collection[j] = predictions
        
    differences = np.array([])
    for k in range(0, simulation_num):
        difference_arrays = np.subtract(train['y'].values[-days_to_test:], pred_collection[k][-days_to_test:])
        difference_values = np.sum(np.abs(difference_arrays))
        differences = np.append(differences,difference_values)
    
    best_fit = np.argmin(differences)
    best_pred = pred_collection[best_fit]
    
    ###########
    
    rmse = mean_squared_error(test['y'], best_pred[-days_to_predict:], squared=False)
    total_rmse += rmse
print('Monto Carlo Simulation RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

7. Mean Prediction

Mean prediction means is predict always train data to mean value. It is useful for the baseline model. If machine learning models have a higher score than mean prediction, we can say that predictions are not random.

total_rmse = 0
for team in df_athletics['Team'].unique():
    train_test = pd.DataFrame()
    train_test['ds'] = df_athletics_timeseries.index
    train_test['y'] = df_athletics_timeseries[team].values
    
    train = train_test[:-test_size]
    test = train_test[-test_size:]
    
    pred = train['y'].mean()
    
    rmse = mean_squared_error(test['y'], [pred, pred], squared=False)
    total_rmse += rmse
print('Mean Prediction RMSE: ', total_rmse/len(df_athletics['Team'].unique()))

8. Results

You can see the scores in below. We did not do hyperparameter tuning for this work except arima and sarimax and we just used the default parameters of the models. With hyperparameter tuning, scores can be better.

Fbprophet RMSE: — — — — — — — — — 3.180356233094025 Darts FFT RMSE: — — — — — — — — — 1.7806018079168815 Darts ExponentialSmoothing RMSE: — 2.125171341043955 Darts RegressionModel RMSE: — — — -1.530246244439605 Darts RandomForest RMSE: — — — —- 1.5932785912162637 Darts LightGBMModel RMSE: — — — — 1.6763816820500097 AutoTS RMSE: — — — — — — — — — — 1.4205065327167645 Arima RMSE: — — — — — — — — — —- 1.5376332124117644 Sarimax RMSE: — — — — — — — — — 1.8889240559227485 Monto Carlo Simulation RMSE: — — — 1.769909711992457 Mean Prediction RMSE: — — — — — — 1.6757768602609089

Discussion

According to the result, four models have a higher score than the mean prediction. They are Darts RegressionModel, Darts RandomForest, AutoTS Arima and AutoTS that have the highest score.

Firstly, when we compare arima and sarimax scores, arima has higher score than sarimax. The reason of this, there is no seasonality affect on medal numbers. For this reason, fbprophet has bad score because the seasonality affect is not closed in default parameters of fbprophet. After closing seasonality affect, fbprophet score is 1.67.

The number of medals a country has won at the Olympics depends on the number of medals won by another country. So the models that has support multiable input variable, has the advantage for this forecasting. That is why AutoTS has best score for this problem. Don’t forget, If we work with another dataset that seasonality affect on and it has just one input variable, the result can be change.

👋 Thanks for reading. If you enjoy my work, don’t forget to like, follow me on medium and on LinkedIn. It will motivate me in offering more content to the Medium community ! 😊

References:

[1]: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results [2]: https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics [3]: https://www.kaggle.com/hasanbasriakcay/which-country-is-good-at-which-sports-in-olympics [4]: https://2001-2009.state.gov/r/pa/ho/time/qfp/104481.htm

Time Series Analysis
Machine Learning
Data Science
Forecasting
Databulls
Recommended from ReadMedium