Free AI web copilot to create summaries, insights and extended knowledge, download it at here

6792

Abstract

using the <code>forecast::autoplot()</code> and <code>geom_line()</code> functions. The <code>scale_color_manual()</code> function allows us to create appropriate labels with colours for each fitted trend.</p><figure id="9c31"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*fB734KjrPvfJM_Mvwbd12w.png"><figcaption></figcaption></figure><p id="2205">Note that our plots all of this look very pretty, its intention is <b>not</b> to gives us predictions, but rather, to help us create a process <b>without the trend; </b>otherwise, the analysis becomes much harder than it already is. In this case, I decided to use the order-five polynomial in this example, although you could try something else. We subtract the estimated trend from the original data and inspect the residuals, along with their ACF and PACF plots.</p> <figure id="77fb"> <div> <div>

            <iframe class="gist-iframe" src="/gist/JairParra/d2af261a2df34a34264a747d694bd5c2.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="b090"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Dp6q7uSAe_UI1T7hoXxiww.png"><figcaption></figcaption></figure><p id="d319">The residuals look zero-trended.</p><figure id="18de"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*bANy3NP1l-oGCjkBx6SADA.png"><figcaption></figcaption></figure><p id="2e5d">The ACF lags all, except for one fall within the 0.25 confidence bounds.</p><figure id="fe53"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9qEYDAWHPQD34GS-9WuDJw.png"><figcaption></figcaption></figure><p id="a5e2">The PACF residuals mostly fall within the confidence bounds; whoever there seems to be some negative autocorrelation present across lags. However, from all the previous, there doesn't seem to be a strong seasonal component present.</p><h2 id="fb4f">Train-test split</h2><p id="80c4">We will now split the data into 32 training data points and 10 test data points. We will produce predictions and compare them to assess fit. We also check the resulting objects formats to make sure everything is in order.</p>
    <figure id="2d6a">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/JairParra/8deb7ce3a664ba3c307be03ab2f6029f.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h2 id="d4c7">Fitting an ARIMA model</h2><p id="369b">Next step is fitting an ARIMA (or SARIMA) model: we use the <code>auto_arima()</code> function, allowing for seasonal search, with a maximum differencing order of d=2, with a selection based on AIC, AICc, and BIC.</p>
    <figure id="5558">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/JairParra/5c64f39d1b422a33214426f1bfb63438.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="c376">We obtain an ARIMA(1,1,0) as our best model, that is, a model to which applying differencing once would yield an AR(1) process. We can inspect this model and check the estimated coefficients, log-likelihood, and information criteria:</p>
    <figure id="1d12">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/JairParra/56cd735ebe467c955b5cdc29058370a9.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h2 id="62ce">Inspecting the residuals</h2><p id="cce4">So, how good is our model? We can use the <code>checkresiduals</code> function to obtain a plot of the residuals, the ACF, distribution, and a Ljung-Box test output as well.</p>
    <figure id="5012">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/JairParra/d47f880bd73776e9abe366c6871d7128.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="4216"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qv2OVq04yPE7xsXE-_IKtA.png"><figcaption></figcaption></figure>
    <figure id="7c2b">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/JairParra/c27127af22c1d6e6594d76f7d636471d.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="b855">We can observe from the residuals that although the process seems to be somehow mean-zero, there is a point in which it totally goes off. You could also argue that all of the lags fall within the confidence bounds, and the data looks somehow normal (given that we don’t have that much data to start with). Indeed, all of these seem to indicate that there is a major outlier. We will ignore this. The Ljung-Box test has a huge value, which indicates strongly non-stationarity. However, this is common when fitting ARIMA models; especially since for instance, we have an ARIMA(1,1,0), which indicates that differencing would indeed create stationarity. The right tests to use are the <b>Augmented Dickey-Fuller test</b>, whose null hypothesis is non-stationarity, and the <b>KPSS test, </b>whose null defines stationarity.</p>
    <figure id="66fe">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/JairParra/45bb5732b0f66a4615e9342ed7f7d61c.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="7344">So here, we <b>reject</b> for the DFS test, and <b>fail to reject</b> for the KPSS. This indicates that the process would indeed be stationary. Note that we difference the series first! We can also check the other residuals (using only the train-data dates!):</p>
    <figure id="2a7e">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/JairParra/e0a910256685b9c00941b710ff8df2ee.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="3dc9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*U

Options

ovXa6n1BweIJ2d9Xw5BYw.png"><figcaption></figcaption></figure><figure id="546c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*SewC6bcgWSaoWqmqYBEPBQ.png"><figcaption></figcaption></figure><figure id="c5b6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-wY4I_0x_yKqc7PmGjM8Qg.png"><figcaption></figcaption></figure><figure id="8a27"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*tZu_P4FNKAgiqVfwQuMsFw.png"><figcaption></figcaption></figure><p id="35e1">In particular, inspecting the inverse roots of the AR(1) polynomial guarantees the process is stationary and causal, and of course, it is also invertible.</p><h2 id="5f88">Forecasting</h2><p id="ff03">Let’s now produce a table with the point forecast values along with the errors and confidence intervals for predictions</p> <figure id="ebea"> <div> <div>

            <iframe class="gist-iframe" src="/gist/JairParra/326bfda3b4708b1f37082f178711b0c0.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="b8a0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vXhXvyfhbxi6AfVKRMnLZA.png"><figcaption></figcaption></figure><p id="cf47">Next, we extract the values as plain vectors for plotting: we paste this to a bunch of `NA` values to be able to plot altogether.</p>
    <figure id="7abf">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/JairParra/92476dc01e420d94b417b1c3b225e814.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="66a1">If we plot just the forecasts directly, we obtain the following</p>
    <figure id="9b38">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/JairParra/c01eac076797bafa9ebaaa2d93c256c6.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="99b2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*BuIb7gXWcgtzfs9YYZrlAg.png"><figcaption></figcaption></figure><p id="8a66">But wait, what happened with the scale! Somehow, the xts and ts objects are not entirely compatible, so this happens. We can, however, correct this manually as follows:</p>
    <figure id="4e9e">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/JairParra/68828b056fcc8f26bd27f7de8a246354.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="f596">Which produces</p><figure id="93e1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*dHDs9t8SgeqRasaOeEuLFw.png"><figcaption></figcaption></figure><p id="9712">And that’s the cool plot you saw at the beginning of the article!</p><p id="fcfe">For this exercise, I originally had 42 data points (that is, weekdays data for 3 months, I leave the math to you), from which I used 32 for training and 10 for testing. The results after adjusting the trend are shown by the graph above; the squiggly lines represent the original data points, while the almost straight line at the end represents the predictions from the model. They look pretty close to the real ones eh? <b>Here’s the catch: </b>notice the blue area and the bigger area around it? these are 80% and 95% confidence bounds respectively. This is saying: 80% and 95% of the time, respectively, the real value, as opposed to our predictions, will fall inside that interval. In this case, these are huge!!! In particular, the farther we predict into the future, the wider they become. Notice, for instance, the one for May 01: the lowest value of the 95% lower bound is roughly 34$, while the biggest one is 15$. In a real-world situation, this prediction is absolutely flawed and disastrous; this is as good as guessing by eye what tomorrow’s value will be! (or even worse). The truth is, even domain professionals often have a hard time doing these kinds of predictions. This shows just how hard it is to predict the stock market.</p><h2 id="348b">Disclaimer</h2><p id="ce68">Although this exercise was based on very real data, recent as to May 06, 2020, this is no more than an educational <b>toy exercise</b> and does not represent in any way a professional analysis or opinion. I don’t recommend doing this kind of analysis on your own, and I am not liable in any way, as by reading these tutorials you accept that it is your own responsibility for whatever happens if you do decide to use them for that purpose. If you wish to invest in the stock market, you should seek advice from a professional in the field!</p><figure id="ce45"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*Hzfy2bms0rE3JWvL.jpg"><figcaption></figcaption></figure><h2 id="0d80">Last words</h2><p id="03e4">Make sure to check out my tutorial series “<a href="https://readmedium.com/a-complete-introduction-to-time-series-analysis-with-r-9882f2d44c9d">A Complete Introduction to Time Series Analysis (with R)</a>”, which are based on a full-book that I am currently writing at the moment. Also, you can find the full code and data for this tutorial <a href="https://github.com/JairParra/Stock_market_prediction">here</a> .Stay tuned, and happy learning!</p><div id="8bc5" class="link-block">
      <a href="https://readmedium.com/a-complete-introduction-to-time-series-analysis-with-r-9882f2d44c9d">
        <div>
          <div>
            <h2>A Complete Introduction To Time Series Analysis (with R)</h2>
            <div><h3>During these times of the Covid19 pandemic, you have perhaps heard about the collaborative efforts to predict new…</h3></div>
            <div><p>medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*TL2PeOANEN4zG0_OqoHptQ.jpeg)"></div>
          </div>
        </div>
      </a>
    </div><h2 id="1086">Follow me at</h2><ol><li><a href="https://www.linkedin.com/in/hair-parra-526ba19b/">https://www.linkedin.com/in/hair-parra-526ba19b/</a></li><li><a href="https://github.com/JairParra">https://github.com/JairParra</a></li></ol><h2 id="386f">Copyright</h2><figure id="c638"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*euysvC3sPut2MRsUiNl6rQ.png"><figcaption></figcaption></figure></article></body>

Predicting stocks: Not a trivial matter!

Surely, you have probably seen a lot of tutorials on using Time Series Analysis to predict the stock market. In reality, even field experts often have trouble making accurate predictions. The natural question that should come to your mind is “why is it so hard to predict stocks?”. As a little exercise, I decided to put to test my knowledge on Time Series Analysis and attempted to build a prediction model for stocks closing prices of a certain company, call it X (for legal reasons), based on three months of historical data. What were the results? Let’s explore together!

Loading the packages

Let’s first load the packages that we will need for this tutorial:

Note that I have used a particular package called ggtheme .This allows personalizing ggplot plots with different backgrounds and the like. Here, I have used the custom theme theme_stonks , but you could use something else!

The data

Let’s first take a look at the data in R (which you can download here); you will have to change the PATH according to your download location:

The picture above shows the output you should see. For this tutorial, we will only work with the Datecolumn and Close prices column (you could try on your own with the Adj Close one instead!).

Preprocessing the data

We will now extract the columns of interest, as well as convert them to appropriate time series objects; in this case, xts objects from the xts package.

Note that first, we convert the Date column to a POSIXct format, as this is required by the xts constructor.

Inspecting the data

Let’s now inspect the data by plotting the points using the autoplot function:

The plot above shows the closing price for the stocks of company X from Mars 03 to May 01, weekly Monday-to-Friday data. What do we see? Perhaps that it goes a little bit like crazy, initially starting at around 41.30$ on Mars 04, and dropping to a disastrous ~ 33.0$ on Mars 14. After this, we notice a recovery, but nonetheless volatility almost every day. On May 01, we see a drop of roughly 2.5$ in price. So what can we conclude from all of this? Well… not much, really. Knowing all of this gives us some cues, but it doesn’t really help us see into the future!

Let’s now inspect the ACF and PACF to check for stationarity:

From the ACF plot, we see that our raw series is clearly not stationary, as almost half of the lags fall out of the confidence bounds. As for the PAC, the first volatility and somehow exponential decreasing (in absolute value) seem to indicate some kind of AR model might seem appropriate.

Estimating the trend

Of course, the first most obvious step is to fit some models for the trend.We will now estimate and plot a bunch of different trends for our data:

Let’s see what happened here: the tslm function allows us to fit linear models to ts objects (which is why we cast from xts ! ). We fit a linear and an order-5 polynomial trend, along with an order-5 moving average. We then stack all the trends in a data frame and plot them all together using the forecast::autoplot() and geom_line() functions. The scale_color_manual() function allows us to create appropriate labels with colours for each fitted trend.

Note that our plots all of this look very pretty, its intention is not to gives us predictions, but rather, to help us create a process without the trend; otherwise, the analysis becomes much harder than it already is. In this case, I decided to use the order-five polynomial in this example, although you could try something else. We subtract the estimated trend from the original data and inspect the residuals, along with their ACF and PACF plots.

The residuals look zero-trended.

The ACF lags all, except for one fall within the 0.25 confidence bounds.

The PACF residuals mostly fall within the confidence bounds; whoever there seems to be some negative autocorrelation present across lags. However, from all the previous, there doesn't seem to be a strong seasonal component present.

Train-test split

We will now split the data into 32 training data points and 10 test data points. We will produce predictions and compare them to assess fit. We also check the resulting objects formats to make sure everything is in order.

Fitting an ARIMA model

Next step is fitting an ARIMA (or SARIMA) model: we use the auto_arima() function, allowing for seasonal search, with a maximum differencing order of d=2, with a selection based on AIC, AICc, and BIC.

We obtain an ARIMA(1,1,0) as our best model, that is, a model to which applying differencing once would yield an AR(1) process. We can inspect this model and check the estimated coefficients, log-likelihood, and information criteria:

Inspecting the residuals

So, how good is our model? We can use the checkresiduals function to obtain a plot of the residuals, the ACF, distribution, and a Ljung-Box test output as well.

We can observe from the residuals that although the process seems to be somehow mean-zero, there is a point in which it totally goes off. You could also argue that all of the lags fall within the confidence bounds, and the data looks somehow normal (given that we don’t have that much data to start with). Indeed, all of these seem to indicate that there is a major outlier. We will ignore this. The Ljung-Box test has a huge value, which indicates strongly non-stationarity. However, this is common when fitting ARIMA models; especially since for instance, we have an ARIMA(1,1,0), which indicates that differencing would indeed create stationarity. The right tests to use are the Augmented Dickey-Fuller test, whose null hypothesis is non-stationarity, and the KPSS test, whose null defines stationarity.

So here, we reject for the DFS test, and fail to reject for the KPSS. This indicates that the process would indeed be stationary. Note that we difference the series first! We can also check the other residuals (using only the train-data dates!):

In particular, inspecting the inverse roots of the AR(1) polynomial guarantees the process is stationary and causal, and of course, it is also invertible.

Forecasting

Let’s now produce a table with the point forecast values along with the errors and confidence intervals for predictions

Next, we extract the values as plain vectors for plotting: we paste this to a bunch of `NA` values to be able to plot altogether.

If we plot just the forecasts directly, we obtain the following

But wait, what happened with the scale! Somehow, the xts and ts objects are not entirely compatible, so this happens. We can, however, correct this manually as follows:

Which produces

And that’s the cool plot you saw at the beginning of the article!

For this exercise, I originally had 42 data points (that is, weekdays data for 3 months, I leave the math to you), from which I used 32 for training and 10 for testing. The results after adjusting the trend are shown by the graph above; the squiggly lines represent the original data points, while the almost straight line at the end represents the predictions from the model. They look pretty close to the real ones eh? Here’s the catch: notice the blue area and the bigger area around it? these are 80% and 95% confidence bounds respectively. This is saying: 80% and 95% of the time, respectively, the real value, as opposed to our predictions, will fall inside that interval. In this case, these are huge!!! In particular, the farther we predict into the future, the wider they become. Notice, for instance, the one for May 01: the lowest value of the 95% lower bound is roughly 34$, while the biggest one is 15$. In a real-world situation, this prediction is absolutely flawed and disastrous; this is as good as guessing by eye what tomorrow’s value will be! (or even worse). The truth is, even domain professionals often have a hard time doing these kinds of predictions. This shows just how hard it is to predict the stock market.

Disclaimer

Although this exercise was based on very real data, recent as to May 06, 2020, this is no more than an educational toy exercise and does not represent in any way a professional analysis or opinion. I don’t recommend doing this kind of analysis on your own, and I am not liable in any way, as by reading these tutorials you accept that it is your own responsibility for whatever happens if you do decide to use them for that purpose. If you wish to invest in the stock market, you should seek advice from a professional in the field!

Last words

Make sure to check out my tutorial series “A Complete Introduction to Time Series Analysis (with R)”, which are based on a full-book that I am currently writing at the moment. Also, you can find the full code and data for this tutorial here .Stay tuned, and happy learning!

A Complete Introduction To Time Series Analysis (with R)

During these times of the Covid19 pandemic, you have perhaps heard about the collaborative efforts to predict new…

medium.com

Follow me at

Copyright