-analysis-eda-of-secs-quarterly-data-summary-455e62ff4817">companies’ fundamental data in order to predict future stock prices</a>, some companies report them quarterly, while others twice a year. Some do so in November, while others in August. If you would wait until everyone provides their data, you might come up with a precise prediction, but at that time all the lucrative stock would be sold out.</p><figure id="3886"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*VSkoXXHY_muJAnQoHpethw.png"><figcaption>Image by Author (<a href="https://about.canva.com/license-agreements/onedesign/">license</a>)</figcaption></figure><blockquote id="b517"><p>Usually 80% of the inputs hold the same pattern and 20% including some imporant scenarios have unusual structure you have to recalculate.</p></blockquote><p id="dcf4">To work around these data leaks, you can upsample to average.</p><h2 id="1d6c">Upsample with an average</h2><p id="8863">I’m talking about the option to smooth the inconsistently reported data, so that you can</p><div id="4afe"><pre>[IN]:
data=<span class="hljs-string">"""
CAT|DATE|VALUE
abc|0101|10
abc|0103|20
abc|0105|15
efg|0102|40
efg|0105|30"""</span></pre></div><div id="9138"><pre>[OUT]:
CAT|<span class="hljs-type">DATE</span>|<span class="hljs-type">VALUE</span>
abc|<span class="hljs-type">0102</span>|<span class="hljs-type">10</span>
abc|<span class="hljs-type">0103</span>|<span class="hljs-type">10</span> # <span class="hljs-number">20</span> on <span class="hljs-number">0103</span> covers interval <span class="hljs-number">0102</span><span class="hljs-number">-0103</span> (<span class="hljs-number">2</span> days)
abc|<span class="hljs-type">0104</span>|<span class="hljs-type">7</span><span class="hljs-number">.5</span>
abc|<span class="hljs-type">0105</span>|<span class="hljs-type">7</span><span class="hljs-number">.5</span> # <span class="hljs-number">15</span> on <span class="hljs-number">0105</span> covers interval <span class="hljs-number">0104</span><span class="hljs-number">-0105</span> (<span class="hljs-number">2</span> days)
efg|<span class="hljs-type">0103</span>|<span class="hljs-type">10</span>
efg|<span class="hljs-type">0104</span>|<span class="hljs-type">10</span>
efg|<span class="hljs-type">0105</span>|<span class="hljs-type">10</span> # <span class="hljs-number">30</span> on <span class="hljs-number">0105</span> covers interval <span class="hljs-number">0103</span><span class="hljs-number">-0105</span> (<span class="hljs-number">3</span> days)</pre></div><p id="733f">The example above is quite simple, but imagine you have thousands of categories. Some reported daily, others bi-daily in odd or even days, the rest every three days or irregularly.</p><h2 id="ebb4">How to do the average magic</h2><p id="4582">In order to calculate the average, you need two things — the <code>sum</code> divided by the <code>count</code>. You know the sum, because it was reported, but the number of days (months, weeks …) since the last value is unknown. Luckily, pandas resampling contains a parameter <code>.group</code> which exposes, how many groups were created when upsampled.</p><p id="cd5a">Unfortunately, it doesn’t work natively with <code>groupby</code> so we’ll look on a single category first.</p><div id="7afc"><pre><span class="hljs-attr">data</span>=<span class="hljs-string">"""
CAT|DATE|VALUE
abc|0101|10
abc|0103|20
abc|0106|15"""</span></pre></div><p id="dafe">The <a href="https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html">resample</a> method has two attributes:</p><ul><li><code>indices</code> — showing where the input data appeared</li><li><code>groups</code> — indexing each DateTime value with its group index.</li></ul><div id="61ff"><pre><span class="hljs-comment"># indices</span>
[<span class="hljs-keyword">In</span>]: df.set_index(<span class="hljs-string">"DATE"</span>).resample(<span class="hljs-string">"D"</span>).indices
[Out]:
defaultdict(list,
{Timestamp(<span class="hljs-string">'1900-01-01 00:00:00'</span>, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>): [0],
Timestamp(<span class="hljs-string">'1900-01-03 00:00:00'</span>, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>): [1],
Timestamp(<span class="hljs-string">'1900-01-06 00:00:00'</span>, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>): [2]})</pre></div><div id="4ea2"><pre><span class="hljs-comment"># groups</span>
[<span class="hljs-keyword">In</span>]: df.set_index(<span class="hljs-string">"DATE"</span>).resample(<span class="hljs-string">"D"</span>).groups
[Out]:
{Timestamp(<span class="hljs-string">'1900-01-01 00:00:00'</span>, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>): 1,
Timestamp(<span class="hljs-string">'1900-01-02 00:00:00'</span>, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>): 1,
Timestamp(<span class="hljs-string">'1900-01-03 00:00:00'</span>, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>): 2,
Timestamp(<span class="hljs-string">'1900-01-04 00:00:00'</span>, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>): 2,
Timestamp(<span class="hljs-string">'1900-01-05 00:00:00'</span>, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>): 2,
Timestamp(<span class="hljs-string">'1900-01-06 00:00:00'</span>, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>): 3}</pre></div><p id="a4c1">January first and second forms the first group, third to fifth are the second and the last data sample falls into the last group. Mostly the data are reported that the value on the reporting date is included — 20 reported on Jan-3 covers Jan-2 and Jan-3. In that case, you use <code>label</code> parameter and set it to <code>right</code>.</p><div id="6ff4"><pre><span class="hljs-meta"># groups</span>
[In]: df.set_index(<span class="hljs-string">"DATE"</span>).resample(<span class="hljs-string">"D"</span>, label=<span class="hljs-string">"right).groups</span>
[Out]:
{
Timestamp('<span class="hljs-number">1900-01-02</span> 00:00:00', freq='D'): <span class="hljs-number">1</span>,
Timestamp('<span class="hljs-number">1900-01-03</span> 00:00:00', freq='D'): <span class="hljs-number">1</span>,
Timestamp('<span class="hljs-number">1900-01-04</span> 00:00:00', freq='D'): <span class="h
Options
ljs-number">2</span>,
Timestamp('<span class="hljs-number">1900-01-05</span> 00:00:00', freq='D'): <span class="hljs-number">2</span>,
Timestamp('<span class="hljs-number">1900-01-06</span> 00:00:00', freq='D'): <span class="hljs-number">2</span>,
Timestamp('<span class="hljs-number">1900-01-07</span> 00:00:00', freq='D'): <span class="hljs-number">3</span>,}</pre></div><p id="81e1">Alternatively you can turn the groups into a dataframe and shift the values</p><div id="de05"><pre>df_groups = pd<span class="hljs-selector-class">.DataFrame</span>(df<span class="hljs-selector-class">.set_index</span>(<span class="hljs-string">"DATE"</span>)<span class="hljs-selector-class">.resample</span>(<span class="hljs-string">"D"</span>)<span class="hljs-selector-class">.groups</span>, index=<span class="hljs-selector-attr">[<span class="hljs-string">"group"</span>]</span>)<span class="hljs-selector-class">.T</span><span class="hljs-selector-class">.shift</span>(<span class="hljs-number">1</span>)</pre></div><p id="f54a">Then you count the occurrences in each group.</p><div id="d491"><pre><span class="hljs-attr">s</span> = df_groups.groupby(<span class="hljs-string">"group"</span>).size()
<span class="hljs-attr">s.name</span> = <span class="hljs-string">"count"</span></pre></div><p id="595b">And join the counts with the dates and backfilled values. It results into a dataframe having a sum for the period and number of occurrences in this period which is all you need to calculate the average.</p>
<figure id="8c2a">
<div>
<div>
<iframe class="gist-iframe" src="/gist/vaclavdekanovsky/4458418101fdeca5df08db468fe887a8.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="baa3">If you shout with excitement now, hold on, we’re not at the finish line yet. We have calculated the average for a single group, but in case we want to <code>groupby</code> multiple categories, e.g. when we have many stores, companies, or sensors to work with, we must calculate the average for each group separately.</p><p id="978e">It’s not complicated, but it can be time-consuming in case you have many inputs. See the code below:</p>
<figure id="3d94">
<div>
<div>
<iframe class="gist-iframe" src="/gist/vaclavdekanovsky/5bd2fd13fd6928a9d8c2aa31dbaf4651.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="94ae">You have smoothed the data to average, but often you don’t need each monthly average. You only needed to cover different reporting periods. For example in financial analysis reviewing the <a href="https://towardsdatascience.com/stock-fundamental-analysis-eda-of-secs-quarterly-data-summary-455e62ff4817">performance of publicly traded companies</a>, most of them report the data at the end of the quarter. But some do it in different months, or even not quarterly. If you want to model the behavior of a whole bunch of companies, you first smooth to average those which doesn’t report quarterly and the use <code>rolling</code> function limit the inputs to quarters only.</p><blockquote id="12ee"><p>More inputs mean more model parameters and longer computing time</p></blockquote><p id="ec2d">Let’s look on the stock market analysis example:</p>
<figure id="0040">
<div>
<div>
<iframe class="gist-iframe" src="/gist/vaclavdekanovsky/7714cb8b36f341947bd716adf0488019.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="49a0">Once again the coding gets us surprised. When you apply <code>.resample("D", label="right")</code> resampling to <b>days </b>the results are the same as <code>.resample("D", label="right", <b>closed="left"</b>)</code>. If you do the same while resampling to <b>months</b>, the <code>label="right"</code> matches <code>label="right", <b>closed="right"</b></code>. For this reason, you must specify <code>label="right", closed="left"</code> for the monthly calculation or use the workaround with <code>shift</code> mentioned above.</p>
<figure id="503d">
<div>
<div>
<iframe class="gist-iframe" src="/gist/vaclavdekanovsky/54db0f00be236590df0d6d764143a6d7.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><h1 id="dae3">Conclusion</h1><p id="9fa0">I hope that you have learned something not only about resampling but how to use alternative approaches to align inconsistent data when feeding your machine learning models.</p><div id="0395"><pre>The images were <span class="hljs-keyword">with</span> canva.com (affiliate link, when you click <span class="hljs-keyword">on</span> <span class="hljs-keyword">it</span> <span class="hljs-keyword">and</span> purchase a product, you won't pay more, <span class="hljs-keyword">but</span> I can receive a small reward; you can always <span class="hljs-built_in">write</span> canva.com <span class="hljs-keyword">to</span> your browser <span class="hljs-keyword">to</span> avoid this). Canva offer <span class="hljs-keyword">some</span> free templates <span class="hljs-keyword">and</span> graphics too.
Many models input cumulative variables like rainfalls in a few hours or company revenue over some months. Unfortunately, the data sources are often not aligned in time. One sensor feeds the data every odd hour, another one in even hours. One company provides information in May and the other in June.
If you don’t want to wait for all your data sources to collect the inputs or when it doesn’t suit to feed the model with data coming from different time frames, you have to spread your measurement to the same periods of time. Pandas can natively do part of the job, but in this article, we’ll explore how to upsample with an average, which requires a little bit of extra coding.
You can run all the example from this tutorial through the notebook shared on Github — Upsample_to_average.ipynb
Resampling
Downsample
Resampling in python’s Pandas allows you to turn more frequent values to less frequent — downsample, e.g. hourly data to daily sum, count and average, or daily to monthly values.
It’s called downsample because the number of data-rows decreases. You can apply sum, count, mean (for average), median, min, max, first or last. Based on daily inputs you can resample to weeks, months, quarters, years, but also to semi-months — see the complete list of resample options in pandas documentation. You can also resample to multiplies, e.g. 5H for groups of 5 hours.
Upsample
Resampling to more frequent timestamps is called upsampling. You can turn days into hours or months into days. When upsampling you must first set the DateTime column as an index of your data frame (.set_index), and only then you can upsample.
# errorincase you try upsample using .resample("D", on="col")
ValueError: Upsampling from level= oron= selection isnot supported, use .set_index(...) to explicitly set index to datetime-like
Besides the functions above, the upsampling supports backfill/bfill, ffill/pad and nearest. Though if you think that applying mean would split the daily value to an average per hour, you would be wrong. All the added up-samples are filled with NaN.
Why do we want to upsample into average
Let’s say you have a portfolio of shops and try to estimate the sales. You input the revenue on Jan-1, Jan-2 and Jan-3 and perform a regression to guess the target value on Jan-4. You train on all your shops and use the model to predict future sales.
But what if for some shops provide data only for Jan-2 and Jan-3. And others only onJan-1 and Jan-3. You can downsample and model using sum or average of these 3 days, but in such a case you lose valuable input information like Friday peeks or low Monday takings.
Contrary to common belief, this scenario is typical. When analyzing companies’ fundamental data in order to predict future stock prices, some companies report them quarterly, while others twice a year. Some do so in November, while others in August. If you would wait until everyone provides their data, you might come up with a precise prediction, but at that time all the lucrative stock would be sold out.
The example above is quite simple, but imagine you have thousands of categories. Some reported daily, others bi-daily in odd or even days, the rest every three days or irregularly.
How to do the average magic
In order to calculate the average, you need two things — the sum divided by the count. You know the sum, because it was reported, but the number of days (months, weeks …) since the last value is unknown. Luckily, pandas resampling contains a parameter .group which exposes, how many groups were created when upsampled.
Unfortunately, it doesn’t work natively with groupby so we’ll look on a single category first.
January first and second forms the first group, third to fifth are the second and the last data sample falls into the last group. Mostly the data are reported that the value on the reporting date is included — 20 reported on Jan-3 covers Jan-2 and Jan-3. In that case, you use label parameter and set it to right.
s = df_groups.groupby("group").size()
s.name = "count"
And join the counts with the dates and backfilled values. It results into a dataframe having a sum for the period and number of occurrences in this period which is all you need to calculate the average.
If you shout with excitement now, hold on, we’re not at the finish line yet. We have calculated the average for a single group, but in case we want to groupby multiple categories, e.g. when we have many stores, companies, or sensors to work with, we must calculate the average for each group separately.
It’s not complicated, but it can be time-consuming in case you have many inputs. See the code below:
You have smoothed the data to average, but often you don’t need each monthly average. You only needed to cover different reporting periods. For example in financial analysis reviewing the performance of publicly traded companies, most of them report the data at the end of the quarter. But some do it in different months, or even not quarterly. If you want to model the behavior of a whole bunch of companies, you first smooth to average those which doesn’t report quarterly and the use rolling function limit the inputs to quarters only.
More inputs mean more model parameters and longer computing time
Let’s look on the stock market analysis example:
Once again the coding gets us surprised. When you apply .resample("D", label="right") resampling to days the results are the same as .resample("D", label="right", closed="left"). If you do the same while resampling to months, the label="right" matches label="right", closed="right". For this reason, you must specify label="right", closed="left" for the monthly calculation or use the workaround with shift mentioned above.
Conclusion
I hope that you have learned something not only about resampling but how to use alternative approaches to align inconsistent data when feeding your machine learning models.
The images were with canva.com (affiliate link, when you click onitand purchase a product, you won't pay more, but I can receive a small reward; you can always write canva.com to your browser to avoid this). Canva offer some free templates and graphics too.
Other articles:
* Everything you wanted to know about Kfold train-test split
* Why it's worth considering another file types than csv
* How to turn a listof addreses into a map