Free AI web copilot to create summaries, insights and extended knowledge, download it at here

15629

Abstract

keyword">add the UNIX timestamp which is the number

of seconds since 1970 on UTC, it is a very convenient

format because it is the same in every time zone in the world!</pre></div><div id="658c"><pre>df_st1dept1['timestamp'] = OpenBlender.dateToUnix(df_st1dept1['Date'],

                   date_format = <span class="hljs-string">'%Y-%m-%d'</span>, 
                   timezone = <span class="hljs-string">'GMT'</span>)</pre></div><div id="ca18"><pre><span class="hljs-attr">df_st1dept1</span> = df_st1dept1.sort_values(<span class="hljs-string">'timestamp'</span>).reset_index(drop = <span class="hljs-literal">True</span>)</pre></div><p id="a3c3">Now, let’s <b>search for intersected (time overlapped) datasets </b>about ‘business’ or ‘walmart’ in OpenBlender .</p><p id="8ad7"><b>Note:</b> To get a token you <i>need</i> have to create an account on <a href="https://www.openblender.io/#/welcome/or/39">openblender.io</a> (free), you’ll find it in the ‘Account’ tab on your profile icon.</p><figure id="d4fe"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*p-lGJjgHmmdEpixC"><figcaption></figcaption></figure><figure id="57cd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*9iOplhlnZk7MXQt3.png"><figcaption></figcaption></figure><div id="b504"><pre><span class="hljs-attr">token</span> = <span class="hljs-string">'YOUR_TOKEN_HERE'</span></pre></div><div id="dff1"><pre><span class="hljs-function"><span class="hljs-title">print</span>(<span class="hljs-string">'From : '</span> + <span class="hljs-variable">OpenBlender.unixToDate</span>(<span class="hljs-title"><span class="hljs-built_in">min</span></span>(<span class="hljs-variable">df_st1dept1.timestamp</span>)))</span></pre></div><div id="70ec"><pre><span class="hljs-function"><span class="hljs-title">print</span>(<span class="hljs-string">'Until: '</span> + <span class="hljs-variable">OpenBlender.unixToDate</span>(<span class="hljs-title"><span class="hljs-built_in">max</span></span>(<span class="hljs-variable">df_st1dept1.timestamp</span>)))</span></pre></div><div id="dea8"><pre><span class="hljs-comment"># Now, let's search on OpenBlender</span>

search_keyword = 'business walmart'</pre></div><div id="dd1c"><pre># We need to pass our timestamp column and

search keywords as parameters.

OpenBlender.searchTimeBlends(token, df_st1dept1.timestamp, search_keyword)</pre></div><figure id="c7b1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*F2A8H5lR1Gv6Kc2j.png"><figcaption></figcaption></figure>The search found several datasets. We can see name, description, url, features, and most importantly, time interesction with ours so we can blend them to our dataset.Let’s start by blending this <a href="https://www.openblender.io/#/dataset/explore/5e1deeda9516290a00c5f8f6">walmart tweets</a> dataset and look for promos.<figure id="7e5e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*uu75yKDIKsrMblGf.png"><figcaption></figcaption></figure><ul><li>Note: I picked this one because it makes sense, but you can search for hundreds of other ones.</li></ul>We can blend new columns to our dataset by searching terms on texts or news aggregated by time. For instance, we could create a ‘promo’ feature with the number of mentions which will match our self made ngrams:<div id="808f"><pre>text_filter = {'name' : 'promo', 'match_ngrams': ['promo', 'dicount', 'cut', 'markdown','deduction']}</pre></div><div id="33ed"><pre># blend_source needs the id_dataset and the name of the feature.blend_source = { 'id_dataset':'5e1deeda9516290a00c5f8f6', 'feature' : 'text', 'filter_text' : text_filter }</pre></div><div id="ea86"><pre>df_blend = OpenBlender.timeBlend( token = token, anchor_ts = df_st1dept1.timestamp, blend_source = blend_source, blend_type = 'agg_in_intervals', interval_size = 60 * 60 * 24 * 7, direction = 'time_prior', interval_output = 'list')</pre></div><div id="108d"><pre>df_anchor = pd.concat([df_st1dept1, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)</pre></div>The parameters for the timeBlend function (you can find the documentation <a href="https://www.openblender.io/#/api_documentation">here</a>):<ul><li>anchor_ts: We only need to send our timestamp column so that it can be used as an anchor to blend the external data.</li><li>blend_source: The information about the feature we want.</li><li>blend_type: ‘agg_in_intervals’ because we want 1 week interval aggregation to each of our observations.</li><li>inverval_size: The size of the interval in seconds (24 * 7 hours in this case).</li><li>direction: ‘time_prior’ because we want the interval to gather observations from the prior 7 days and not forward to avoid data leakage.</li></ul>We now have our original dataset but with 2 new columns, the ‘COUNT’ of our ‘promo’ feature and a list of the actual texts in case one wants to iterate through each one.<div id="5605"><pre>df_anchor.tail()</pre></div><figure id="2005"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*Iqnuiu-BxmNQ3NBy.png"><figcaption></figcaption></figure>So now we have a numerical feature about how many times our ngrams were mentioned. We could probably do better ngrams if we knew which store or department corresponds to ‘1’ (Walmart didn’t share that).Let’s apply the Standard Model and compare the error vs. the original.<div id="a4e1"><pre>our_model.train(df_anchor, 'Weekly_Sales') error_sum = our_model.getMetrics(df_anchor, 'Weekly_Sales') error_sum</pre></div><div id="c6ea"><pre>#> 253875.30</pre></div>The current model had a $253, 975 error while the previous one had$ 290,037. That’s a 12% improvement.But this doesn’t prove much, it could be that the RandomForest got lucky. After all, the original model trained with over 299K observations. The current one only training with 102!!We can also blend numerical features. Let’s try blending <a href="https://www.openblender.io/#/dataset/explore/5e91029d9516297827b8f08c">Dollar index</a>, <a href="http://5e91045a9516297827b8f5b1/">Oil price</a>, and <a href="https://www.openblender.io/#/dataset/explore/5e979cf195162963e9c9853f">Monthly Consumer Sentiment</a><div id="6f3b"><pre># OIL blend_source = { 'id_dataset':'5e91045a9516297827b8f5b1', 'feature' : 'price' }</pre></div><div id="c1b9"><pre>df_blend = OpenBlender.timeBlend( token = token, anchor_ts = df_anchor.timestamp, blend_source = blend_source, blend_type = 'agg_in_intervals', interval_size = 60 * 60 * 24 * 7, direction = 'time_prior', interval_output = 'avg', missing_values = 'impute')</pre></div><div id="33aa"><pre>df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)</pre></div><div id="190e"><pre># DOLLAR INDEX</pre></div><div id="2cfb"><pre>blend_source = { 'id_dataset':'5e91029d9516297827b8f08c', 'feature' : 'price' }</pre></div><div id="78f9"><pre>df_blend = OpenBlender.timeBlend( token = token, anchor_ts = df_anchor.timestamp, blend_source = blend_source, blend_type = 'agg_in_intervals', interval_size = 60 * 60 * 24 * 7, direction = 'time_prior', interval_output = 'avg', missing_values = 'impute')</pre></div><div id="2b02"><pre>df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)</pre></div><div id="e197"><pre># CONSUMER SENTIMENT</pre></div><div id="5d61"><pre>blend_source = { 'id_dataset':'5e979cf195162963e9c9853f', 'feature' : 'umcsent' }</pre></div><div id="8bae"><pre>df_blend = OpenBlender.timeBlend( token = token, anchor_ts = df_anchor.timestamp, blend_source = blend_source, blend_type = 'agg_in_intervals', interval_size = 60 * 60 * 24 * 7, direction = 'time_prior', interval_output = 'avg', missing_values = 'impute')</pre></div><div id="8db8"><pre>df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)</pre></div><div id="1110"><pre>df_anchor</pre></div><figure id="7120"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*P10dBozQ_R5sHAzc.png"><figcaption></figcaption></figure>Now we have 6 more features, the average of oil index, dollar index and consumer sentiment on the 7 day intervals and the count of each (which for this case is irrelevant).Let’s run that model again.<div id="090f"><pre>our_model.train(df_anchor, 'Weekly_Sales') error_sum = our_model.getMetrics(df_anchor, 'Weekly_Sales') error_sum</pre></div><div id="747d"><pre>>223831.9414</pre></div>Now, we’re down to a $223,831 error. That’s a 24.1% improvement with repect to the original$ 290,037 !!So let’s now try it with every department separately to measure how consistent the benefit is.<h1 id="bb1d">Step 4. Test in All Departments</h1>To get a glimpse, we’re gonna experiment with the first 10 Departments first and compare the benefit of adding each additional source.<div id="1855"><pre># Function to filter features from other sources</pre></div><div id="4297"><pre>def excludeColsWithPrefix(df, prefixes): cols = df.columns for prefix in prefixes: cols = [col for col in cols if prefix not in col] return df[cols]</pre></div><div id="7f86"><pre><span class="hlj

Options

s-attr">error_sum_enhanced = []</pre></div><div id="19ba"><pre>action = 'API_getObservationsFromDataset'</pre></div><div id="4021"><pre># Loop through the first 10 Departments and test them.</pre></div><div id="7f78"><pre>for dept in range(1, 10): print('---') print('Starting department ' + str(dept))

<span class="hljs-comment"># Get it into a dataframe</span>
df_dept = df_walmart_st1[df_walmart_st1[<span class="hljs-string">'Dept'</span>] == dept]


<span class="hljs-comment"># Unix Timestamp</span>
df_dept[<span class="hljs-string">'timestamp'</span>] = OpenBlender.dateToUnix(df_dept[<span class="hljs-string">'Date'</span>], 
                                       date_format = <span class="hljs-string">'%Y-%m-%d'</span>, 
                                       timezone = <span class="hljs-string">'GMT'</span>)</pre></div><div id="9b41"><pre># <span class="hljs-keyword">Function</span> <span class="hljs-keyword">to</span> <span class="hljs-keyword">filter</span> features <span class="hljs-keyword">from</span> other sources</pre></div><div id="3457"><pre>def excludeColsWithPrefix(df, prefixes):
<span class="hljs-built_in">cols</span> = df.columns
<span class="hljs-keyword">for</span> prefix <span class="hljs-keyword">in</span> prefixes:
    <span class="hljs-built_in">cols</span> = [<span class="hljs-built_in">col</span> <span class="hljs-keyword">for</span> <span class="hljs-built_in">col</span> <span class="hljs-keyword">in</span> <span class="hljs-built_in">cols</span> <span class="hljs-keyword">if</span> prefix not <span class="hljs-keyword">in</span> <span class="hljs-built_in">col</span>]
<span class="hljs-keyword">return</span> df[<span class="hljs-built_in">cols</span>]</pre></div><div id="0bed"><pre><span class="hljs-attr">error_sum_enhanced</span> = []</pre></div><div id="465c"><pre><span class="hljs-comment"># Loop through the first 10 Departments and test them.</span></pre></div><div id="0d91"><pre><span class="hljs-keyword">for</span> dept <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">1</span>, <span class="hljs-number">10</span>):
<span class="hljs-built_in">print</span>(<span class="hljs-string">'---'</span>)
<span class="hljs-built_in">print</span>(<span class="hljs-string">'Starting department '</span> + <span class="hljs-built_in">str</span>(dept))

<span class="hljs-comment"># Get it into a dataframe</span>
df_dept = df_walmart_st1[df_walmart_st1[<span class="hljs-string">'Dept'</span>] == dept]


<span class="hljs-comment"># Unix Timestamp</span>
df_dept[<span class="hljs-string">'timestamp'</span>] = OpenBlender.dateToUnix(df_dept[<span class="hljs-string">'Date'</span>], 
                                       date_format = <span class="hljs-string">'%Y-%m-%d'</span>, 
                                       timezone = <span class="hljs-string">'GMT'</span>)</pre></div><div id="f3c9"><pre>df_dept = df_dept.sort_values(<span class="hljs-string">'timestamp'</span>).reset_index(<span class="hljs-keyword">drop</span> = <span class="hljs-keyword">True</span>)


# "PROMO" FEATURE <span class="hljs-keyword">OF</span> MENTIONS <span class="hljs-keyword">ON</span> WALMART

text_filter = {<span class="hljs-string">'name'</span> : <span class="hljs-string">'promo'</span>, 
           <span class="hljs-string">'match_ngrams'</span>: [<span class="hljs-string">'promo'</span>, <span class="hljs-string">'dicount'</span>, <span class="hljs-string">'cut'</span>, <span class="hljs-string">'markdown'</span>,<span class="hljs-string">'deduction'</span>]}

blend_source = {
                <span class="hljs-string">'id_dataset'</span>:<span class="hljs-string">'5e1deeda9516290a00c5f8f6'</span>,
                <span class="hljs-string">'feature'</span> : <span class="hljs-string">'text'</span>,
                <span class="hljs-string">'filter_text'</span> : text_filter
            }</pre></div><div id="3fcb"><pre><span class="hljs-attr">df_blend</span> = OpenBlender.timeBlend( token = token,
                                  <span class="hljs-attr">anchor_ts</span> = df_st1dept1.timestamp,
                                  <span class="hljs-attr">blend_source</span> = blend_source,
                                  <span class="hljs-attr">blend_type</span> = <span class="hljs-string">'agg_in_intervals'</span>,
                                  <span class="hljs-attr">interval_size</span> = <span class="hljs-number">60</span> * <span class="hljs-number">60</span> * <span class="hljs-number">24</span> * <span class="hljs-number">7</span>,
                                  <span class="hljs-attr">direction</span> = <span class="hljs-string">'time_prior'</span>,
                                  <span class="hljs-attr">interval_output</span> = <span class="hljs-string">'list'</span>)</pre></div><div id="77d1"><pre>df_anchor = pd.concat([df_st1dept1, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = <span class="hljs-number">1</span>)

<span class="hljs-meta"># OIL </span>
blend_source = {
                'id_dataset':'5e<span class="hljs-number">9104</span>5a<span class="hljs-number">9516297827</span>b8f5b1',
                'feature' : 'price'
            }</pre></div><div id="c6a1"><pre><span class="hljs-attr">df_blend</span> = OpenBlender.timeBlend( token = token,
                                  <span class="hljs-attr">anchor_ts</span> = df_anchor.timestamp,
                                  <span class="hljs-attr">blend_source</span> = blend_source,
                                  <span class="hljs-attr">blend_type</span> = <span class="hljs-string">'agg_in_intervals'</span>,
                                  <span class="hljs-attr">interval_size</span> = <span class="hljs-number">60</span> * <span class="hljs-number">60</span> * <span class="hljs-number">24</span> * <span class="hljs-number">7</span>,
                                  <span class="hljs-attr">direction</span> = <span class="hljs-string">'time_prior'</span>,
                                  <span class="hljs-attr">interval_output</span> = <span class="hljs-string">'avg'</span>,
                                  <span class="hljs-attr">missing_values</span> = <span class="hljs-string">'impute'</span>)</pre></div><div id="2654"><pre><span class="hljs-attr">df_anchor</span> = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != <span class="hljs-string">'timestamp'</span>]], axis = <span class="hljs-number">1</span>)</pre></div><div id="4e44"><pre># <span class="hljs-built_in">DOLLAR</span> <span class="hljs-built_in">INDEX</span></pre></div><div id="5c69"><pre>blend_source = {
                'id_dataset':'5e<span class="hljs-number">9102</span>9d<span class="hljs-number">9516297827</span>b8f08c',
                'feature' : 'price'
            }</pre></div><div id="4fcd"><pre><span class="hljs-attr">df_blend</span> = OpenBlender.timeBlend( token = token,
                                  <span class="hljs-attr">anchor_ts</span> = df_anchor.timestamp,
                                  <span class="hljs-attr">blend_source</span> = blend_source,
                                  <span class="hljs-attr">blend_type</span> = <span class="hljs-string">'agg_in_intervals'</span>,
                                  <span class="hljs-attr">interval_size</span> = <span class="hljs-number">60</span> * <span class="hljs-number">60</span> * <span class="hljs-number">24</span> * <span class="hljs-number">7</span>,
                                  <span class="hljs-attr">direction</span> = <span class="hljs-string">'time_prior'</span>,
                                  <span class="hljs-attr">interval_output</span> = <span class="hljs-string">'avg'</span>,
                                  <span class="hljs-attr">missing_values</span> = <span class="hljs-string">'impute'</span>)</pre></div><div id="f42c"><pre><span class="hljs-attr">df_anchor</span> = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != <span class="hljs-string">'timestamp'</span>]], axis = <span class="hljs-number">1</span>)</pre></div><div id="ca48"><pre><span class="hljs-meta"># CONSUMER SENTIMENT</span></pre></div><div id="f916"><pre>blend_source = {
                'id_dataset':'5e979cf<span class="hljs-number">19516296</span>3e9c<span class="hljs-number">9853</span>f',
                'feature' : 'umcsent'
            }</pre></div><div id="918a"><pre><span class="hljs-attr">df_blend</span> = OpenBlender.timeBlend( token = token,
                                  <span class="hljs-attr">anchor_ts</span> = df_anchor.timestamp,
                                  <span class="hljs-attr">blend_source</span> = blend_source,
                                  <span class="hljs-attr">blend_type</span> = <span class="hljs-string">'agg_in_intervals'</span>,
                                  <span class="hljs-attr">interval_size</span> = <span class="hljs-number">60</span> * <span class="hljs-number">60</span> * <span class="hljs-number">24</span> * <span class="hljs-number">7</span>,
                                  <span class="hljs-attr">direction</span> = <span class="hljs-string">'time_prior'</span>,
                                  <span class="hljs-attr">interval_output</span> = <span class="hljs-string">'avg'</span>,
                                  <span class="hljs-attr">missing_values</span> = <span class="hljs-string">'impute'</span>)</pre></div><div id="7cc8"><pre>df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != <span class="hljs-string">'timestamp'</span>]], axis = <span class="hljs-number">1</span>)


<span class="hljs-keyword">try</span>:
    
    error_sum = {}
    
    <span class="hljs-comment"># Gather errors from every source by itself# Dollar Index</span>
    df_selection = excludeColsWithPrefix(df_anchor, [<span class="hljs-string">'WALMART_TW'</span>, <span class="hljs-string">'US_MONTHLY_CONSUMER'</span>, <span class="hljs-string">'OIL_INDEX'</span>])
    our_model.train(df_selection, <span class="hljs-string">'weekly_sales'</span>)
    error_sum[<span class="hljs-string">'1_features'</span>] = our_model.getMetrics(df_selection, <span class="hljs-string">'weekly_sales'</span>)
    
    <span class="hljs-comment"># Walmart News</span>
    df_selection = excludeColsWithPrefix(df_anchor, [ <span class="hljs-string">'US_MONTHLY_CONSUMER'</span>, <span class="hljs-string">'OIL_INDEX'</span>])
    our_model.train(df_selection, <span class="hljs-string">'weekly_sales'</span>)
    error_sum[<span class="hljs-string">'2_feature'</span>] = our_model.getMetrics(df_selection, <span class="hljs-string">'weekly_sales'</span>)
    
    <span class="hljs-comment"># Oil Index</span>
    df_selection = excludeColsWithPrefix(df_anchor,[<span class="hljs-string">'US_MONTHLY_CONSUMER'</span>])
    our_model.train(df_selection, <span class="hljs-string">'weekly_sales'</span>)
    error_sum[<span class="hljs-string">'3_features'</span>] = our_model.getMetrics(df_selection, <span class="hljs-string">'weekly_sales'</span>)
    
    <span class="hljs-comment"># Consumer Sentiment (All features)</span>
    df_selection = df
    our_model.train(df_selection, <span class="hljs-string">'weekly_sales'</span>)
    error_sum[<span class="hljs-string">'4_features'</span>] = our_model.getMetrics(df_selection, <span class="hljs-string">'weekly_sales'</span>)
    
<span class="hljs-keyword">except</span>:
    
    <span class="hljs-built_in">print</span>(traceback.format_exc())
    <span class="hljs-built_in">print</span>(<span class="hljs-string">"No observations found for department: "</span> + <span class="hljs-built_in">str</span>(dept))
    error_sum = <span class="hljs-number">0</span>
    
error_sum_enhanced.append(error_sum)</pre></div><p id="3f00">Let’s get the results into a DataFrame and visualize.</p><div id="55ae"><pre>separated_results = pd.<span class="hljs-symbol">DataFrame</span>(error_sum_enhanced)

separated_results['original_error'] = error_summary_df[0:10]['error_sum_normal_model'] separated_results = separated_results[['original_error', '1_feature', '2_features', '3_features', '4_features']] separated_results</pre></div><div id="270d"><pre>separated_results.transpose().plot(kind='line')</pre></div><figure id="c04b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vZbswwm-KenU6vrnDZSeKg.png"><figcaption></figcaption></figure>Departments 4 and 6 are on a higher order than the rest Let’s remove them to take a closer look at the rest.<div id="4a63"><pre>separated_results.drop([6, 4]).transpose().plot(kind=’line’)</pre></div><figure id="da64"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*8rfOD1_XRYfmOf0TlU75iQ.png"><figcaption></figcaption></figure>We can see that almost on all departments the error lowered as we added new features. We can also see that the Oil Index (third feature) was not only not helpful but even harmful for some departments.I excluded the Oil Index and ran the algorithm with the 3 features on all the departments (which you can do by iterating all the error_summary_df and not just the first 10).Let’s see the results.<h1 id="03c7">Step 5. Measure Results</h1>These are the results of the ‘3 feature’ blend and the improvement percentage on all departments:<figure id="5625"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*PUSWRg-GqeA2zpqxnnzzKw.png"><figcaption></figcaption></figure>Not only did the added features improve the error on over 88% of the Departments, but some improvements were significant.This is the histogram for the improvement percentage.<figure id="878d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qhWLvtSFmlPB7G55YR3uKA.png"><figcaption></figcaption></figure>The original error (calculated at the beginning of the article) was $24,009,404.06 USD, and the final error is$ 9,596,215.21 USD meaning it was reduced by over 60%And this is just one store.Thank you for reading.</article></body>

Immensely Improving Every ‘Walmart Sales’ Forecasting Model

Simplicity is key.

“To better understand the marketplace, it is incumbent for organizations to look beyond their own four walls for data sources.”

Douglas Laney (VP, Gartner Research)

Intro

There have been several implementations of the popular Walmart Sales Forecast competition to predict their sales.

However, all of them seem to attempt to increase accuracy (reduce error) by focusing on mainly two things:

1) Feature engineering (getting the most out of your features)

2) Model/parameter optimization (choosing best model & best parameters)

Both of the above are very necessary indeed, but there is a third thing that adds value in a complementary way, and it’s wildly underused not only in this use case (which understandably was against the rules of the competition) but in most data science projects:

Combining external information.

In this article, we’ll do a simple sales forecast model and then blend external variables (properly done).

The title of this article refers to improving all models, not because of doing something else, but by doing the same thing with more useful data.

So we’ll use the same model and we won’t do data wrangling or engineering at any point, so that we can tell apart only the benefit of adding useful features.

What we’ll do

Step 1: Define and understand Target
Step 2: Make a Simple Forecast Model
Step 3: Add Financial Indicators and News
Step 4: Test the Models
Step 5: Measure Results

Step 1. Define and understand Target

Walmart released data containing weekly sales for 99 departments (clothing, electronics, food…) in every physical store along with some other added features.

For this, we will create an ML model with ‘Weekly_Sales’ as target, and train with the first 70% observations and test on the posterior 30%.

The objective is to minimize the Prediction error on future weekly sales.

We’ll add external variables that impact or have a relationship with sales such as dollar index, oil price and news about Walmart.

We won’t use model/parameter optimization nor feature engineering so we can distinguish the benefit from adding the external features.

Step 2. Make a Simple Forecast Model

First, you need to have Python 2 or 3 installed and the following libraries:

$ pip install pandas OpenBlender scikit-learn

Then, open a Python script (preferably Jupyter notebook) and let’s import the needed libraries.

from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import OpenBlender
import json

Now, let’s define the methodology and model to be used on all experiments.

First, the date range of the data is from Jan 2010 to Dec 2012. Let’s define the first 70% of the data used for training and the posterior 30% for testing (because we don’t want data leakage on our predictions).

Next, let’s define as our standard model a RandomForestRegressor with 50 estimators, which is a reasonably good option.

Finally, to keep things as simple as possible, let’s define the error as the absolute sum of errors.

Now, let’s put it in a Python class.

class StandardModel:
    
    model = RandomForestRegressor(n_estimators=50, criterion='mse')
    

    def train(self, df, target):

        # Drop non numerics
        df = df.dropna(axis=1).select_dtypes(['number'])

        # Create train/test sets
        X = df.loc[:, df.columns != target].values
        y = df.loc[:,[target]].values

        # We take the first bit of the data as test and the 
        # last as train because the data is ordered desc.
        div = int(round(len(X) * 0.29))

        X_train = X[div:]
        y_train = y[div:]

        print('Train Shape:')
        print(X_train.shape)
        print(y_train.shape)

        #Random forest model specification
        self.model = RandomForestRegressor(n_estimators=50)

        # Train on data
        self.model.fit(X_train, y_train.ravel())

   def getMetrics(self, df, target):
        # Function to get the error sum from the trained model

        # Drop non numerics
        df = df.dropna(axis=1).select_dtypes(['number'])

        # Create train/test sets
        X = df.loc[:, df.columns != target].values
        y = df.loc[:,[target]].values

        div = int(round(len(X) * 0.29))

        X_test = X[:div]
        y_test = y[:div]

        print('Test Shape:')
        print(X_test.shape)
        print(y_test.shape)
        
        # Predict on test
        y_pred_random = self.model.predict(X_test)

        # Gather absolute error
        error_sum = sum(abs(y_test.ravel() - y_pred_random))

        return error_sum

Above we have an object with 3 elements:

model (RandomForestRegressor)
train: A function to train that model with a dataframe and a target
getMetrics: A function to test with the trained model with the test data and retrieve the error

We will use this configuration for all of the experiments, although you can modify it as you want to test different models, parameters, configuration or whatever else. The value added will remain and could potentially improve.

Now, let’s get the Walmart data. You can get that csv here.

df_walmart = pd.read_csv('walmartData.csv')
print(df_walmart.shape)
df_walmart.head()

There are 421, 570 observations. As we said before, the observations are registers of weekly sales by store per department.

Let's plug the data into the model without tampering with it at all.

our_model = StandardModel()
our_model.train(df_walmart, 'Weekly_Sales')

total_error_sum = our_model.getMetrics(df_walmart, 'Weekly_Sales')
print("Error sum: " + str(total_error_sum))

> Error sum: 967705992.5034052

The sum of all errors for the complete model is $ 967,705,992.5 USD from all the predictions vs. the real sales.

This doesn’t mean much by itself, the only reference is that the sum of all sales in that period is $ 6,737,218,987.11 USD.

Since there is a whole lot of data, we will only focus on Store #1 for this tutorial, but the methodology is absolutely replicable for all stores.

Let’s take a look at the error generated by Store 1 alone.

# Select store 1
df_walmart_st1 = df_walmart[df_walmart['Store'] == 1]

error_sum_st1 = our_model.getMetrics(df_walmart_st1, 'Weekly_Sales')
print("Error sum error_sum_st1: " + str(error_sum_st1))

# > Error sum error_sum_st1: 24009404.060399983

So, Store 1 is responsible for the $24,009,404.06 USD error and this will be our threshold for comparison.

Now let’s break down the error by department to have more visibility later.

error_summary = []

for i in range(1,100):
    try:
        df_dept = df_walmart_st1[df_walmart_st1['Dept'] == i]
        error_sum = our_model.getMetrics(df_dept, 'Weekly_Sales')
        print("Error dept : " + str(i) + ' is: ' + str(error_sum))
        error_summary.append({'dept' : i, 'error_sum_normal_model' : error_sum})
    except: 
        error_sum = 0
        print('No obs for Dept: ' + str(i))

error_summary_df = pd.DataFrame(error_summary)
error_summary_df.head()

Now we have a dataframe with the errors corresponding to each department on Store 1 with our threshold model.

Let’s improve these numbers.

Step 3. Add Financial Indicators and News

Let’s select department 1 to make a simple example.

df_st1dept1 = df_walmart_st1[df_walmart_st1['Dept'] == 1]

Now, lets prep the timestamp variable.

# First we need to add the UNIX timestamp which is the number 
# of seconds since 1970 on UTC, it is a very convenient 
# format because it is the same in every time zone in the world!

df_st1dept1['timestamp'] = OpenBlender.dateToUnix(df_st1dept1['Date'], 
                       date_format = '%Y-%m-%d', 
                       timezone = 'GMT')

df_st1dept1 = df_st1dept1.sort_values('timestamp').reset_index(drop = True)

Now, let’s search for intersected (time overlapped) datasets about ‘business’ or ‘walmart’ in OpenBlender .

Note: To get a token you need have to create an account on openblender.io (free), you’ll find it in the ‘Account’ tab on your profile icon.

token = 'YOUR_TOKEN_HERE'

print('From : ' + OpenBlender.unixToDate(min(df_st1dept1.timestamp)))

print('Until: ' + OpenBlender.unixToDate(max(df_st1dept1.timestamp)))

# Now, let's search on OpenBlender
search_keyword = 'business walmart'

# We need to pass our timestamp column and 
# search keywords as parameters.
OpenBlender.searchTimeBlends(token,
                             df_st1dept1.timestamp,
                             search_keyword)

The search found several datasets. We can see name, description, url, features, and most importantly, time interesction with ours so we can blend them to our dataset.

Let’s start by blending this walmart tweets dataset and look for promos.

Note: I picked this one because it makes sense, but you can search for hundreds of other ones.

We can blend new columns to our dataset by searching terms on texts or news aggregated by time. For instance, we could create a ‘promo’ feature with the number of mentions which will match our self made ngrams:

text_filter = {'name' : 'promo', 
               'match_ngrams': ['promo', 'dicount', 'cut', 'markdown','deduction']}

# blend_source needs the id_dataset and the name of the feature.blend_source = {
                'id_dataset':'5e1deeda9516290a00c5f8f6',
                'feature' : 'text',
                'filter_text' : text_filter
            }

df_blend = OpenBlender.timeBlend( token = token,
                                  anchor_ts = df_st1dept1.timestamp,
                                  blend_source = blend_source,
                                  blend_type = 'agg_in_intervals',
                                  interval_size = 60 * 60 * 24 * 7,
                                  direction = 'time_prior',
                                  interval_output = 'list')

df_anchor = pd.concat([df_st1dept1, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)

The parameters for the timeBlend function (you can find the documentation here):

anchor_ts: We only need to send our timestamp column so that it can be used as an anchor to blend the external data.
blend_source: The information about the feature we want.
blend_type: ‘agg_in_intervals’ because we want 1 week interval aggregation to each of our observations.
inverval_size: The size of the interval in seconds (24 * 7 hours in this case).
direction: ‘time_prior’ because we want the interval to gather observations from the prior 7 days and not forward to avoid data leakage.

We now have our original dataset but with 2 new columns, the ‘COUNT’ of our ‘promo’ feature and a list of the actual texts in case one wants to iterate through each one.

df_anchor.tail()

So now we have a numerical feature about how many times our ngrams were mentioned. We could probably do better ngrams if we knew which store or department corresponds to ‘1’ (Walmart didn’t share that).

Let’s apply the Standard Model and compare the error vs. the original.

our_model.train(df_anchor, 'Weekly_Sales')
error_sum = our_model.getMetrics(df_anchor, 'Weekly_Sales')
error_sum

#> 253875.30

The current model had a $253, 975 error while the previous one had $290,037. That’s a 12% improvement.

But this doesn’t prove much, it could be that the RandomForest got lucky. After all, the original model trained with over 299K observations. The current one only training with 102!!

We can also blend numerical features. Let’s try blending Dollar index, Oil price, and Monthly Consumer Sentiment

# OIL
blend_source = {
                'id_dataset':'5e91045a9516297827b8f5b1',
                'feature' : 'price'
            }

df_blend = OpenBlender.timeBlend( token = token,
                                  anchor_ts = df_anchor.timestamp,
                                  blend_source = blend_source,
                                  blend_type = 'agg_in_intervals',
                                  interval_size = 60 * 60 * 24 * 7,
                                  direction = 'time_prior',
                                  interval_output = 'avg',
                                  missing_values = 'impute')

df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)

# DOLLAR INDEX

blend_source = {
                'id_dataset':'5e91029d9516297827b8f08c',
                'feature' : 'price'
            }

df_blend = OpenBlender.timeBlend( token = token,
                                  anchor_ts = df_anchor.timestamp,
                                  blend_source = blend_source,
                                  blend_type = 'agg_in_intervals',
                                  interval_size = 60 * 60 * 24 * 7,
                                  direction = 'time_prior',
                                  interval_output = 'avg',
                                  missing_values = 'impute')

df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)

# CONSUMER SENTIMENT

blend_source = {
                'id_dataset':'5e979cf195162963e9c9853f',
                'feature' : 'umcsent'
            }

df_blend = OpenBlender.timeBlend( token = token,
                                  anchor_ts = df_anchor.timestamp,
                                  blend_source = blend_source,
                                  blend_type = 'agg_in_intervals',
                                  interval_size = 60 * 60 * 24 * 7,
                                  direction = 'time_prior',
                                  interval_output = 'avg',
                                  missing_values = 'impute')

df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)

df_anchor

Now we have 6 more features, the average of oil index, dollar index and consumer sentiment on the 7 day intervals and the count of each (which for this case is irrelevant).

Let’s run that model again.

our_model.train(df_anchor, 'Weekly_Sales')
error_sum = our_model.getMetrics(df_anchor, 'Weekly_Sales')
error_sum

>223831.9414

Now, we’re down to a $223,831 error. That’s a 24.1% improvement with repect to the original $290,037 !!

So let’s now try it with every department separately to measure how consistent the benefit is.

Step 4. Test in All Departments

To get a glimpse, we’re gonna experiment with the first 10 Departments first and compare the benefit of adding each additional source.

# Function to filter features from other sources

def excludeColsWithPrefix(df, prefixes):
    cols = df.columns
    for prefix in prefixes:
        cols = [col for col in cols if prefix not in col]
    return df[cols]

error_sum_enhanced = []

action = 'API_getObservationsFromDataset'

# Loop through the first 10 Departments and test them.

for dept in range(1, 10):
    print('---')
    print('Starting department ' + str(dept))
    
    # Get it into a dataframe
    df_dept = df_walmart_st1[df_walmart_st1['Dept'] == dept]
    
    
    # Unix Timestamp
    df_dept['timestamp'] = OpenBlender.dateToUnix(df_dept['Date'], 
                                           date_format = '%Y-%m-%d', 
                                           timezone = 'GMT')

# Function to filter features from other sources

def excludeColsWithPrefix(df, prefixes):
    cols = df.columns
    for prefix in prefixes:
        cols = [col for col in cols if prefix not in col]
    return df[cols]

error_sum_enhanced = []

# Loop through the first 10 Departments and test them.

for dept in range(1, 10):
    print('---')
    print('Starting department ' + str(dept))
    
    # Get it into a dataframe
    df_dept = df_walmart_st1[df_walmart_st1['Dept'] == dept]
    
    
    # Unix Timestamp
    df_dept['timestamp'] = OpenBlender.dateToUnix(df_dept['Date'], 
                                           date_format = '%Y-%m-%d', 
                                           timezone = 'GMT')

df_dept = df_dept.sort_values('timestamp').reset_index(drop = True)
    
    
    # "PROMO" FEATURE OF MENTIONS ON WALMART
    
    text_filter = {'name' : 'promo', 
               'match_ngrams': ['promo', 'dicount', 'cut', 'markdown','deduction']}
    
    blend_source = {
                    'id_dataset':'5e1deeda9516290a00c5f8f6',
                    'feature' : 'text',
                    'filter_text' : text_filter
                }

df_blend = OpenBlender.timeBlend( token = token,
                                      anchor_ts = df_st1dept1.timestamp,
                                      blend_source = blend_source,
                                      blend_type = 'agg_in_intervals',
                                      interval_size = 60 * 60 * 24 * 7,
                                      direction = 'time_prior',
                                      interval_output = 'list')

df_anchor = pd.concat([df_st1dept1, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)
    
    # OIL 
    blend_source = {
                    'id_dataset':'5e91045a9516297827b8f5b1',
                    'feature' : 'price'
                }

df_blend = OpenBlender.timeBlend( token = token,
                                      anchor_ts = df_anchor.timestamp,
                                      blend_source = blend_source,
                                      blend_type = 'agg_in_intervals',
                                      interval_size = 60 * 60 * 24 * 7,
                                      direction = 'time_prior',
                                      interval_output = 'avg',
                                      missing_values = 'impute')

df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)

# DOLLAR INDEX

blend_source = {
                    'id_dataset':'5e91029d9516297827b8f08c',
                    'feature' : 'price'
                }

df_blend = OpenBlender.timeBlend( token = token,
                                      anchor_ts = df_anchor.timestamp,
                                      blend_source = blend_source,
                                      blend_type = 'agg_in_intervals',
                                      interval_size = 60 * 60 * 24 * 7,
                                      direction = 'time_prior',
                                      interval_output = 'avg',
                                      missing_values = 'impute')

df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)

# CONSUMER SENTIMENT

blend_source = {
                    'id_dataset':'5e979cf195162963e9c9853f',
                    'feature' : 'umcsent'
                }

df_blend = OpenBlender.timeBlend( token = token,
                                      anchor_ts = df_anchor.timestamp,
                                      blend_source = blend_source,
                                      blend_type = 'agg_in_intervals',
                                      interval_size = 60 * 60 * 24 * 7,
                                      direction = 'time_prior',
                                      interval_output = 'avg',
                                      missing_values = 'impute')

df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)
    
    
    try:
        
        error_sum = {}
        
        # Gather errors from every source by itself# Dollar Index
        df_selection = excludeColsWithPrefix(df_anchor, ['WALMART_TW', 'US_MONTHLY_CONSUMER', 'OIL_INDEX'])
        our_model.train(df_selection, 'weekly_sales')
        error_sum['1_features'] = our_model.getMetrics(df_selection, 'weekly_sales')
        
        # Walmart News
        df_selection = excludeColsWithPrefix(df_anchor, [ 'US_MONTHLY_CONSUMER', 'OIL_INDEX'])
        our_model.train(df_selection, 'weekly_sales')
        error_sum['2_feature'] = our_model.getMetrics(df_selection, 'weekly_sales')
        
        # Oil Index
        df_selection = excludeColsWithPrefix(df_anchor,['US_MONTHLY_CONSUMER'])
        our_model.train(df_selection, 'weekly_sales')
        error_sum['3_features'] = our_model.getMetrics(df_selection, 'weekly_sales')
        
        # Consumer Sentiment (All features)
        df_selection = df
        our_model.train(df_selection, 'weekly_sales')
        error_sum['4_features'] = our_model.getMetrics(df_selection, 'weekly_sales')
        
    except:
        
        print(traceback.format_exc())
        print("No observations found for department: " + str(dept))
        error_sum = 0
        
    error_sum_enhanced.append(error_sum)

Let’s get the results into a DataFrame and visualize.

separated_results = pd.DataFrame(error_sum_enhanced)
separated_results['original_error'] = error_summary_df[0:10]['error_sum_normal_model']
separated_results = separated_results[['original_error', '1_feature', '2_features', '3_features', '4_features']]
separated_results

separated_results.transpose().plot(kind='line')

Departments 4 and 6 are on a higher order than the rest Let’s remove them to take a closer look at the rest.

separated_results.drop([6, 4]).transpose().plot(kind=’line’)

We can see that almost on all departments the error lowered as we added new features. We can also see that the Oil Index (third feature) was not only not helpful but even harmful for some departments.

I excluded the Oil Index and ran the algorithm with the 3 features on all the departments (which you can do by iterating all the error_summary_df and not just the first 10).

Let’s see the results.

Step 5. Measure Results

These are the results of the ‘3 feature’ blend and the improvement percentage on all departments:

Not only did the added features improve the error on over 88% of the Departments, but some improvements were significant.

This is the histogram for the improvement percentage.

The original error (calculated at the beginning of the article) was $24,009,404.06 USD, and the final error is $9,596,215.21 USD meaning it was reduced by over 60%

And this is just one store.

Thank you for reading.