avatarMate Pocs

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5204

Abstract

401">In order to show the convergence in the same magnitude, we can also transform the <i>S</i>(<i>n</i>) by subtracting <i>μ</i> and multiplying by [sqrt(<i>n</i>) / <i>σ</i>], I’m calling this new distribution Standardised <i>S</i>(<i>n</i>). See CLT’s <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">Wikipedia page</a> for more details on the math. With this transformation, <i>N</i>(0,1) will be the approximating normal distribution, regardless of <i>n</i>.</p><p id="e6dc">Standardised CDF charts for <i>n</i> = 10, 20, 50:</p><figure id="9a3e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*pJ1fuVVWhOQusy7SCIKQKw.png"><figcaption></figcaption></figure><figure id="fd0c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vFOm-WMB0Pz0JEEXKVMByw.png"><figcaption></figcaption></figure><figure id="7f2b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*e9s7TeC6PPWvykcpt_8YqA.png"><figcaption>CDFs of Standardised S(n) and Standard Normal distributions</figcaption></figure><p id="e34b">and for <i>n</i> = 100, 250, 500:</p><figure id="edc8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*sqHoGAZApTmmnqQpea-PVg.png"><figcaption></figcaption></figure><figure id="6fe1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*d7FU_9n3_QtHbJqA4oyZ7Q.png"><figcaption></figcaption></figure><figure id="0c36"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*K9x25w4jg7TZLj1BGv_QZg.png"><figcaption>CDFs of Standardised S(n) and Standard Normal distributions</figcaption></figure><p id="02ca">The smoothing effect is much more visual in this second, standardised set, because the underlying normal distribution does not get more and more concentrated around 0.3. However, it can be misleading, and it can encourage the common misconception that <i>S(n)</i> converges to one specific normal distribution.</p><h2 id="53d1">Animated CDF’s</h2><p id="0457">Finally! This is what we were building up to! In this section we are going to create animations of the CDF charts. It’s basically the same information as we had in the previous section, just more mesmerising:</p><figure id="beec"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*QxlxNl79Mw8jYXNI8hp3iw.gif"><figcaption>Animation of CDFs of S(n) and the corresponding Normal distribution between n = 1 and n = 100</figcaption></figure><p id="70aa">Once again, the CLT and the LLN can be observed beautifully at the same time. The animation above is my absolute favourite one, if anyone asked me to “show” the Central Limit Theorem, that is the visualisation I would pick.</p><p id="92e8">Throughout this post, I did not include code snippets, but I think creating a <code>gif</code> is not entirely straightforward. I experimented with <code>matplotlib</code>’s <code>animation</code> package, but found it cumbersome to work with. The good news is, you can easily create <code>gif</code>’s just by saving your charts as pictures. Once you have a <code>pyplot</code> chart, you can save it as <code>png</code>:</p><div id="e088"><pre>plt.savefig(./pictures_folder/name_of_pictures_<<span class="hljs-built_in">number</span>>.png)</pre></div><p id="f0a4">Change <code>pictures_folder</code> and <code>name_of_pictures_</code> to whatever you want to use, and make sure you put a number at the end of the file names. The <code>gif</code> will be created by taking the files in alphabetical order, so if you go from 1 to 100, make sure to encode it as 001, 002, …</p><p id="003f">Once you have your pictures saved, you have to navigate there and enter the following command in your terminal:</p><div id="67e0"><pre><span class="hljs-built_in">convert</span> -<span class="hljs-built_in">delay</span> <span class="hljs-number">10</span> name_of_pictures_*.png name_of_gif.gif</pre></div><p id="3a40">(At least that is how it works on MacOS, I’m sure something similar exists for other systems.) Just make sure you replace <code>name_of_pictures_</code> to whatever you used to save the <code>png</code> files.</p><p id="565f">While we are at it, we can do the same for the standardised CDFs. For the non-standard case, I only ran the animation until <i>n</i> = 100, because not much is changing at this scope afterwards, but we can go nuts with the standardised distribution. Below, we are plotting it between <i>n</i> = 1 and <i>n</i> = 500.</p><figure id="4cf6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*mtm4SuUZ2qhbQIKtQJhkSg.gif"><figcaption>Animation of CDFs of standardised S(n) and the Standard Normal distribution between n = 1 and n = 500</figcaption></figure><p id="f12a">Same outcome as in the previous section, the convergence shows a nicer pattern, but we lose the shrinking aspect, and the LLN is not visualised.</p><h2 id="b742">Difference from the Normal Distribution</h2><p id="2b8f">Another way we can think about <i>S</i>(<i>n</i>) in relation to a normal distribution is by calculating the differences between the two distributions.</p><p id="2675">Once again, we are going to use our “accurate” samples that we created for the PMFs in the previous post. This is our way of gett

Options

ing around the sampling issue. Please note that randomly generating a sample with a higher magnitude would doubtless yield very similar results. Our approach is just computationally less expensive.</p><p id="59d0">So, for each n, we have a list of 1,000 numbers, which are from the “accurate” sample, ensuring that there is no significant sampling error. For each <i>n</i>, we are also generating the corresponding normal distribution with 1,000 samples. We sort both lists in an ascending order, and calculate the difference. On the charts below, <code>xlabel</code> is simply the position of the element in the list. It does not really have an intrinsic meaning, so I omitted them from the charts below. If we had a sample size of 10,000 instead of 1,000, the shapes would be very similar, with different numbers in the <code>xlabel</code>, which seemed confusing to me. Hence the lack of label below on the x axis.</p><p id="bac6">Plotting this difference for <i>n</i> = 10, 50, 100:</p><figure id="f0da"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GpGxNmLxCtWEIDSuuq35vg.png"><figcaption></figcaption></figure><figure id="ea36"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UpGil9fWjrG_qH9neHcvnQ.png"><figcaption></figcaption></figure><figure id="bf38"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*o-ENjznrvSjd_vhpjd3OYw.png"><figcaption>Differences between S(n) and corresponding Normal Distribution</figcaption></figure><p id="fb75">Looks a bit ugly compared to the other charts in the series, but I like the irregular nature of these charts, you can really see that they are imperfect representations of a strict underlying pattern.</p><p id="1729">If the line is far from 0, in positive or negative direction, that means there is a large difference between the <i>S</i>(<i>n</i>) and the normal distribution. The pattern is very visible: as we increase <i>n</i>, the line gets closer and closer to 0. The “teeth” are also getting smaller and smaller, meaning there are more and more different values the S(n) distribution potentially takes, meaning it is getting closer and closer to a continuous normal distribution.</p><p id="4bec">One really interesting feature we had not seen before is the large relative differences at the two tails. This brings up a very interesting problem, because as we can see, it is the tails where the approximation is the worst, and, ironically, the tail is what we need the normal approximation for. When we calculate things like confidence intervals with the normal approximation, we are mostly interested in the top and bottom x% of the distribution. In the CDF visualisations in the previous sections, those tail areas were kind of hard to see and the blue and green just blended together. With this new visualisation tool, this phenomena is very much visible. And this is the reason why I decided it’s important to include this visibly less appealing chart as well.</p><p id="d87e">In case you are wondering, we are not going to do the standardised versions for this particular chart type, (plotting Standardised S(n) against a N(0,1)), I do not think it would be significantly different from the charts above.</p><h2 id="1658">Animated Differences from Normal</h2><p id="82fc">Just like for the CDFs, let’s see how the charts look if we put everything in one animated <code>gif</code>.</p><figure id="5375"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Ay0azpmjwshSwrG_h22MwA.gif"><figcaption>Animation of difference of S(n) and the corresponding Normal distribution between n = 1 and n = 200</figcaption></figure><p id="cb0e">We can see how the lines are getting closer and closer to 0, with the exception of the tails, which can show wild, unexpected behaviour.</p><h2 id="96e3">Summary</h2><p id="6c2c">And that was the Central Limit Theorem in a nutshell.</p><p id="7749">In summary, we went through the long process of a Bernoulli sample mean converging to a normal distribution. We proposed two visualisation methods for the Central Limit Theorem: combined cumulative distribution functions and the sorted differences between normal and sample mean.</p><p id="069d">As far as I’m concerned, the most significant achievement of this project were the animated charts, I feel like they are as useful as they are pleasing to the eye. Hopefully, we all learned something, and our understanding of this widely used but scarcely understood concept deepened and solidified.</p><p id="be5f">The story, for now, does not continue, see links below for my other posts on the topic.</p><p id="a844"><a href="https://readmedium.com/central-limit-theorem-misconceptions-f9eb9710e4bb">Part 1 — Misconceptions</a></p><p id="372a"><a href="https://readmedium.com/central-limit-theorem-the-case-against-histograms-7fc80f076229">Part 2 — The Case Against Histograms</a></p><p id="e2d8">Part 3 — Recommended Visualisations</p><p id="8b00">However, in the future, I think it would be worth spending some time on running normality tests on sample mean samples, and see if there is any truth behind the <i>“magic number 30”</i> concept that we discussed in the first post.</p></article></body>

Central Limit Theorem — Recommended Visualisations

Animated charts that properly capture the sample mean’s convergence in accordance with the Central Limit Theorem.

Photo by Anni Roenkae from Pexels

This is the third post in my Central Limit Theorem series, see links to the other posts in at the end. We started our analysis by first having a look at the most common misconceptions surrounding the Law of Large Numbers and the Central Limit Theorem, and then we discussed why using histograms as a tool to demonstrate the theorem can be misleading.

I consider this third post to be the most important achievement of the project: we are finally looking at the charts that I think are good visual representations of the theorem.

The underlying code is mostly straightforward, the whole project can be followed on my GitHub. There is one area where I include code in the post, in the part where we create the animated charts.

Our Test Distribution

Like in the second post, we are going to use a Bernoulli distribution to showcase the Central Limit Theorem, referred to as CLT in the future. This random variable can have two values, with probability p it will be 1, and with probability 1-p it will be 0 (0≤p≤1).

Our Bernoulli distribution will have a constant p = 0.3, which means the expected value μ = 0.3, and the variance σ² = 0.3 * 0.7 = 0.21.

And a quick refreshment on our math notations, X(1), X(2), …, X(n) are going to be the individual Bernoulli variables, n is the sample size, and S(n) is the average of the X’s, so S(n) = [X(1) + … + X(n)] / n. It is S(n) that is going to converge to a normal distribution.

Cumulative Distribution Function

You might have noticed how we avoided putting an actual normal distribution on the charts in the previous post. That is because I think it would have lead to more confusion. But now, we are finally plotting S(n) and the approximating normal distribution on the same chart, using a framework that enables us to visualise discrete and continuous variables: the cumulative distribution function, or CDF.

In the pictures below, green is a normal distribution, and please note again, that it is a different distribution for each n, it is N(μ, σ² / n) to be precise, where μ and σ are from the Bernoulli distribution defined above. The expected value is constant, but as n increases, the variance decreases, resulting in an almost vertical green line for higher n’s. Blue line is the CDF of S(n), plotted using the simulated balanced sample we created in the previous post. And of course the phenomena we are observing is how the blue line is getting closer and closer to the corresponding green line.

Plotting lower n = 10, 20, 50 first:

CDFs of S(n) and Normal distributions

just to be safe, let’s see our usual 100, 250, 500 from the previous post:

CDFs of S(n) and Normal distributions

Great, I think the charts above capture the CLT very well. The Law of Large Numbers, LLN is also hinted at: notice how the distribution is more and more condensed around the expected value of the underlying Bernoulli distribution, μ = 0.3. This visually supports the concept that we o-so-sloppily described in the first post: sample mean converges to a normal distribution in distribution and it also converges to the original distribution’s expected value in probability.

In order to show the convergence in the same magnitude, we can also transform the S(n) by subtracting μ and multiplying by [sqrt(n) / σ], I’m calling this new distribution Standardised S(n). See CLT’s Wikipedia page for more details on the math. With this transformation, N(0,1) will be the approximating normal distribution, regardless of n.

Standardised CDF charts for n = 10, 20, 50:

CDFs of Standardised S(n) and Standard Normal distributions

and for n = 100, 250, 500:

CDFs of Standardised S(n) and Standard Normal distributions

The smoothing effect is much more visual in this second, standardised set, because the underlying normal distribution does not get more and more concentrated around 0.3. However, it can be misleading, and it can encourage the common misconception that S(n) converges to one specific normal distribution.

Animated CDF’s

Finally! This is what we were building up to! In this section we are going to create animations of the CDF charts. It’s basically the same information as we had in the previous section, just more mesmerising:

Animation of CDFs of S(n) and the corresponding Normal distribution between n = 1 and n = 100

Once again, the CLT and the LLN can be observed beautifully at the same time. The animation above is my absolute favourite one, if anyone asked me to “show” the Central Limit Theorem, that is the visualisation I would pick.

Throughout this post, I did not include code snippets, but I think creating a gif is not entirely straightforward. I experimented with matplotlib’s animation package, but found it cumbersome to work with. The good news is, you can easily create gif’s just by saving your charts as pictures. Once you have a pyplot chart, you can save it as png:

plt.savefig(./pictures_folder/name_of_pictures_<number>.png)

Change pictures_folder and name_of_pictures_ to whatever you want to use, and make sure you put a number at the end of the file names. The gif will be created by taking the files in alphabetical order, so if you go from 1 to 100, make sure to encode it as 001, 002, …

Once you have your pictures saved, you have to navigate there and enter the following command in your terminal:

convert -delay 10 name_of_pictures_*.png name_of_gif.gif

(At least that is how it works on MacOS, I’m sure something similar exists for other systems.) Just make sure you replace name_of_pictures_ to whatever you used to save the png files.

While we are at it, we can do the same for the standardised CDFs. For the non-standard case, I only ran the animation until n = 100, because not much is changing at this scope afterwards, but we can go nuts with the standardised distribution. Below, we are plotting it between n = 1 and n = 500.

Animation of CDFs of standardised S(n) and the Standard Normal distribution between n = 1 and n = 500

Same outcome as in the previous section, the convergence shows a nicer pattern, but we lose the shrinking aspect, and the LLN is not visualised.

Difference from the Normal Distribution

Another way we can think about S(n) in relation to a normal distribution is by calculating the differences between the two distributions.

Once again, we are going to use our “accurate” samples that we created for the PMFs in the previous post. This is our way of getting around the sampling issue. Please note that randomly generating a sample with a higher magnitude would doubtless yield very similar results. Our approach is just computationally less expensive.

So, for each n, we have a list of 1,000 numbers, which are from the “accurate” sample, ensuring that there is no significant sampling error. For each n, we are also generating the corresponding normal distribution with 1,000 samples. We sort both lists in an ascending order, and calculate the difference. On the charts below, xlabel is simply the position of the element in the list. It does not really have an intrinsic meaning, so I omitted them from the charts below. If we had a sample size of 10,000 instead of 1,000, the shapes would be very similar, with different numbers in the xlabel, which seemed confusing to me. Hence the lack of label below on the x axis.

Plotting this difference for n = 10, 50, 100:

Differences between S(n) and corresponding Normal Distribution

Looks a bit ugly compared to the other charts in the series, but I like the irregular nature of these charts, you can really see that they are imperfect representations of a strict underlying pattern.

If the line is far from 0, in positive or negative direction, that means there is a large difference between the S(n) and the normal distribution. The pattern is very visible: as we increase n, the line gets closer and closer to 0. The “teeth” are also getting smaller and smaller, meaning there are more and more different values the S(n) distribution potentially takes, meaning it is getting closer and closer to a continuous normal distribution.

One really interesting feature we had not seen before is the large relative differences at the two tails. This brings up a very interesting problem, because as we can see, it is the tails where the approximation is the worst, and, ironically, the tail is what we need the normal approximation for. When we calculate things like confidence intervals with the normal approximation, we are mostly interested in the top and bottom x% of the distribution. In the CDF visualisations in the previous sections, those tail areas were kind of hard to see and the blue and green just blended together. With this new visualisation tool, this phenomena is very much visible. And this is the reason why I decided it’s important to include this visibly less appealing chart as well.

In case you are wondering, we are not going to do the standardised versions for this particular chart type, (plotting Standardised S(n) against a N(0,1)), I do not think it would be significantly different from the charts above.

Animated Differences from Normal

Just like for the CDFs, let’s see how the charts look if we put everything in one animated gif.

Animation of difference of S(n) and the corresponding Normal distribution between n = 1 and n = 200

We can see how the lines are getting closer and closer to 0, with the exception of the tails, which can show wild, unexpected behaviour.

Summary

And that was the Central Limit Theorem in a nutshell.

In summary, we went through the long process of a Bernoulli sample mean converging to a normal distribution. We proposed two visualisation methods for the Central Limit Theorem: combined cumulative distribution functions and the sorted differences between normal and sample mean.

As far as I’m concerned, the most significant achievement of this project were the animated charts, I feel like they are as useful as they are pleasing to the eye. Hopefully, we all learned something, and our understanding of this widely used but scarcely understood concept deepened and solidified.

The story, for now, does not continue, see links below for my other posts on the topic.

Part 1 — Misconceptions

Part 2 — The Case Against Histograms

Part 3 — Recommended Visualisations

However, in the future, I think it would be worth spending some time on running normality tests on sample mean samples, and see if there is any truth behind the “magic number 30” concept that we discussed in the first post.

Central Limit Theorem
Statistics
Python
Recommended from ReadMedium