avatarMichael Grogan

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2981

Abstract

it is not possible to know in advance when a particular spike in traffic will occur — as it is heavily dependent on external circumstances and not related to past data.</p><p id="0e9f">A more meaningful exercise would be to run simulations to forecast the range of traffic that one might expect to see given the assumption of a Pareto distribution.</p><p id="a498">The Pareto Distribution is called in Python as follows:</p><div id="5e49"><pre>numpy<span class="hljs-selector-class">.random</span><span class="hljs-selector-class">.pareto</span>(<span class="hljs-selector-tag">a</span>, size=None)</pre></div><p id="1d48"><b>a </b>represents the shape of the distribution, and size is set to <b>10,000</b>, i.e. 10,000 random numbers from the distribution are generated for the Monte Carlo simulation.</p><p id="cd3b">The mean and standard deviation for the original time series are calculated.</p><div id="f2a8"><pre><span class="hljs-attribute">mu</span><span class="hljs-operator">=</span>np.mean(value) <span class="hljs-attribute">sigma</span><span class="hljs-operator">=</span>np.std(value)</pre></div><p id="c481">The time series has a mean of <b>5224 </b>and a standard deviation of <b>2057</b>.</p><p id="2db7">Using these values, a Monte Carlo simulation can be generated using these parameters, along with the random sampling from an assumed Pareto distribution.</p><div id="6265"><pre><span class="hljs-built_in">t</span> = np.random.pareto(a, <span class="hljs-number">10000</span>) * (mu+sigma) <span class="hljs-built_in">t</span></pre></div><p id="e0e6">As mentioned, the value of <b>a </b>is dependent on the shape of the distribution. Let’s set this to <b>3</b> in the first instance.</p><p id="ddab">Here are the recorded values for the distribution in percentile terms:</p><figure id="cd98"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*G2DIrN10xaNP3YKgHPdt6Q.png"><figcaption>Source: Jupyter Notebook Output</figcaption></figure><p id="4804">We can see that the maximum value recorded when <b>a = 3</b> is in excess of 350,000, which is far higher than the maximum recorded by the time series.</p><p id="fe1b">What happens if we set <b>a = 4</b>?</p><figure id="e5f6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*6xNpGD2hkj7gdRzZ9mMgow.png"><figcaption>Source: Jupyter Notebook</figcaption></figure><p id="9d1a">We now see that the maximum recorded value is now in excess of <b>60,000</b>, which is still a lot higher than the maximum recorded by the time series.</p><p id="dd1a">Let’s try <b>a = 5</b>.</p><figure id="fe09"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*AB8j_xl4AHQ0h9jL4ZJoFA.png"><figcaption>Source: Jupyter Notebook Output</figcaption></figure><h1 id="e736">Interpretation</h1><p id="f01b">Maximum page views are just above 35,000, which is more in line with what we have seen in the original time series.</p><p id="4680">However, consider that in this case — we are only look

Options

ing at time series data from 2016 onwards. Many of the most serious earthquakes actually happened before 2016.</p><p id="4aa7">For instance, let us suppose that an earthquake as serious as that of the 2004 Indian Ocean earthquake and tsunami were to happen today — we would reasonably expect that page view interest for the term <b>“earthquake”</b> would be much larger than that which we have observed since 2016.</p><p id="ddd9">If we assume that the Pareto distribution has <b>a = 3</b>, then the model is indicating that page views for this term could spike to in excess of <b>350,000</b>.</p><p id="1f55">In this regard, the Monte Carlo simulation is allowing us to examine scenarios that would be beyond the bounds of the time series data that has been recorded.</p><p id="1eb7">Earthquakes (unfortunately) have been around for a lot longer than the internet has — and therefore we have no way of measuring what page views for this search term would have been like during times where the most powerful earthquakes were recorded.</p><p id="f68f">That said, conducting a Monte Carlo Simulation in conjunction with modelling on the closest theoretical distribution can allow for a strong scenario analysis of what the bounds of a time series <b>could be</b> under particular circumstances.</p><h1 id="1a7d">Conclusion</h1><p id="b860">In this article, you have seen:</p><ul><li>What is a Pareto Distribution</li><li>How to generate such a distribution in Python</li><li>How to combine a Pareto distribution with a Monte Carlo simulation</li></ul><p id="84b9">Many thanks for your time. As always, very grateful for any feedback, thoughts, or indeed questions. Please feel free to leave them in the comments section.</p><p id="f464"><i>Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.</i></p><h1 id="66b7">References</h1><ul><li><a href="https://machinelearningmastery.com/monte-carlo-sampling-for-probability/">Machine Learning Mastery: Monte Carlo Sampling for Probability</a></li><li><a href="http://pageperso.lif.univ-mrs.fr/~francois.denis/IAAM1/numpy-html-1.14.0/reference/generated/numpy.random.pareto.html">Numpy v1.14 manual: numpy.random.pareto</a></li><li><a href="https://stackoverflow.com/questions/8441882/matplotlib-draw-lines-from-x-axis-to-points">Stack Overflow: Matplotlib — Draw lines from x axis to points</a></li><li><a href="https://towardsdatascience.com/monte-carlo-simulations-in-python-analysing-web-page-views-b6dbec2ba683">Towards Data Science - Monte Carlo Simulations in Python: Analysing Web Page Views</a></li><li><a href="https://pageviews.toolforge.org/?project=en.wikipedia.org&amp;platform=all-access&amp;agent=user&amp;redirects=0&amp;start=2015-07-01&amp;end=2020-08-02&amp;pages=Earthquake">Wikimedia Toolforge: Pageviews Analysis</a></li></ul></article></body>

Pareto Distributions and Monte Carlo Simulations

Modelling web page views with a Pareto Distribution

Pareto Distributions are all around us. It has also been referred to as the 80/20 rule. As some examples:

  • 20% of all websites get 80% of the traffic.
  • The top 20% of earners globally make 80% of the income.
  • You wear 20% of your clothes 80% of the time.

Traditionally, we are thought that the assumed distribution for a statistical range is a normal distribution, i.e. one where the mean = median = mode.

Source: RStudio

However, many of the phenomena we observe around us often more closely resemble a Pareto distribution.

Source: Jupyter Notebook Output

In this particular example, we can see a distribution that is heavily right-tailed, i.e. most of the observations with lower values (as defined by the x-axis) tend to the left of the graph, while a select few observations with higher values tend towards the right of the graph.

Modelling Web Page Views with a Monte Carlo Simulation

Let’s take the example of web page views over time. Here is a line graph showing fluctuations over time for the term “earthquake” from January 2016 — August 2020 from Wikimedia Toolforge:

Source: Wikimedia Toolforge

We can see that there are “spikes” in page views at certain periods — possibly at a time when an earthquake is under way somewhere in the world.

This is what we would expect — this is an example of a search term which sees higher page view interest at certain times. As a matter of fact, many webpages follow this pattern, where traffic more or less follows a stationary pattern — accompanied by sudden “spikes”.

Let’s plot a histogram of this data.

Source: Jupyter Notebook Output

In the above instance, we see that the majority of page views for a given day are below 10,000, while there are a select few incidences where this is exceeded.

The maximum number of page views in a given day over the selected time period was 31,520. This closely represents a Pareto Distribution.

Attempting to forecast page views with traditional time series tools such as ARIMA is quite futile. This is because it is not possible to know in advance when a particular spike in traffic will occur — as it is heavily dependent on external circumstances and not related to past data.

A more meaningful exercise would be to run simulations to forecast the range of traffic that one might expect to see given the assumption of a Pareto distribution.

The Pareto Distribution is called in Python as follows:

numpy.random.pareto(a, size=None)

a represents the shape of the distribution, and size is set to 10,000, i.e. 10,000 random numbers from the distribution are generated for the Monte Carlo simulation.

The mean and standard deviation for the original time series are calculated.

mu=np.mean(value)
sigma=np.std(value)

The time series has a mean of 5224 and a standard deviation of 2057.

Using these values, a Monte Carlo simulation can be generated using these parameters, along with the random sampling from an assumed Pareto distribution.

t = np.random.pareto(a, 10000) * (mu+sigma)
t

As mentioned, the value of a is dependent on the shape of the distribution. Let’s set this to 3 in the first instance.

Here are the recorded values for the distribution in percentile terms:

Source: Jupyter Notebook Output

We can see that the maximum value recorded when a = 3 is in excess of 350,000, which is far higher than the maximum recorded by the time series.

What happens if we set a = 4?

Source: Jupyter Notebook

We now see that the maximum recorded value is now in excess of 60,000, which is still a lot higher than the maximum recorded by the time series.

Let’s try a = 5.

Source: Jupyter Notebook Output

Interpretation

Maximum page views are just above 35,000, which is more in line with what we have seen in the original time series.

However, consider that in this case — we are only looking at time series data from 2016 onwards. Many of the most serious earthquakes actually happened before 2016.

For instance, let us suppose that an earthquake as serious as that of the 2004 Indian Ocean earthquake and tsunami were to happen today — we would reasonably expect that page view interest for the term “earthquake” would be much larger than that which we have observed since 2016.

If we assume that the Pareto distribution has a = 3, then the model is indicating that page views for this term could spike to in excess of 350,000.

In this regard, the Monte Carlo simulation is allowing us to examine scenarios that would be beyond the bounds of the time series data that has been recorded.

Earthquakes (unfortunately) have been around for a lot longer than the internet has — and therefore we have no way of measuring what page views for this search term would have been like during times where the most powerful earthquakes were recorded.

That said, conducting a Monte Carlo Simulation in conjunction with modelling on the closest theoretical distribution can allow for a strong scenario analysis of what the bounds of a time series could be under particular circumstances.

Conclusion

In this article, you have seen:

  • What is a Pareto Distribution
  • How to generate such a distribution in Python
  • How to combine a Pareto distribution with a Monte Carlo simulation

Many thanks for your time. As always, very grateful for any feedback, thoughts, or indeed questions. Please feel free to leave them in the comments section.

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

References

Data Science
Machine Learning
Probability
Probability Distributions
Statistics
Recommended from ReadMedium