avatarKyle Jones

Summary

The website content provides a tutorial on visualizing normal distribution using Python and Matplotlib, demonstrating how to simulate data, create histograms, and apply various plotting techniques to analyze the distribution's properties.

Abstract

The provided text is a comprehensive guide on using Python with the Matplotlib library to visualize normal distributions. It begins by explaining the significance of histograms in data analysis for understanding the center, spread, and shape of data. The tutorial then offers a Python function, plot_norm_hist, to generate a histogram of a normal distribution and overlay it with the corresponding probability density function curve. The text illustrates how increasing the number of data points affects the histogram's smoothness and discusses the use of vertical lines to represent the range within one standard deviation from the mean. Additionally, it explores the use of colors and shading with axvspan to highlight different data segments and introduces box plots as another tool for visualizing numerical data, emphasizing their ability to show percentiles and outliers. The article concludes with a recommendation for an AI service, ZAI.chat, as a cost-effective alternative to ChatGPT Plus.

Opinions

  • The author suggests that histograms are crucial for data analysis, particularly for interpreting the concentration, variability, and pattern of data distribution.
  • The use of Matplotlib's plt.hist function in conjunction with the normal distribution equation is presented as an effective method for visualizing the fit of data to a theoretical distribution.
  • The author implies that the visual representation of data becomes more accurate as the sample size increases, transitioning from 1,000 to 100,000 data points.
  • By using green vertical lines, the author demonstrates a technique to highlight the range of data within one standard deviation from the mean, which is a common practice in statistical analysis.
  • The author's inclusion of multiple color layers, while acknowledging that not all would be used simultaneously in real-world applications, serves to illustrate the versatility of Matplotlib for data visualization.
  • The recommendation for ZAI.chat indicates the author's belief in its value as a cost-effective AI service comparable to ChatGPT Plus.

Visualizing the normal distribution with Python and Matplotlib

This is a simple python project to show how to simulate a normal distribution and plot it using Matplotlib.

Another early step in data analysis is the building graphical summaries of the data. These help us focus in on different attributes of the data. One of the most important tools for analyzing numerical data is a histogram.

A histogram is a type of bar chart that divides the total range of the data into a number of “bins” of equal width and then sorts the data into the bins based upon those ranges. It answers the questions about

  1. center (Where do the numbers tend to concentrate?),
  2. spread (How variable is the data?), and
  3. shape (In what pattern do the data tend to fall?).
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def plot_norm_hist(s, mu, sigma, vline = True, title= True):
    count, bins, ignored = plt.hist(s, 30, density=True)
    plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
               np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
         linewidth=2, color='r')
    
    if vline:
        lline = -.67*sigma + mu
        uline = .67*sigma + mu
        plt.axvline(lline, color='g')
        plt.axvline(uline, color='g')

    if title:
        plt.title("Normal distribution with mean: {:.02f} and StDev: {:.02f}".format(mu, sigma))
    return plt.show()

mu, sigma = 0, 1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)

plot_norm_hist(s, mu, sigma, vline=True, title=True)
A histogram showing 1,000 random values. The mean is 0 and the standard deviation is 1.

We can see how the histogram “smooths” as we increase the number of simulated values from 1,000 to 100,000.

mu, sigma = 50, 10 # mean and standard deviation
s = np.random.normal(mu, sigma, 100000)

abs(mu - np.mean(s))

abs(sigma - np.std(s, ddof=1))

count, bins, ignored = plt.hist(s, 30, density=True, alpha=.3)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
               np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
         linewidth=2, color='r')
lline = -.67*sigma + mu
uline = .67*sigma + mu
plt.axvline(lline, color='g')
plt.axvline(uline, color='g')
plt.title("Normal distribution with mean: {:.02f} and StDev: {:.02f}".format(mu, sigma))
plt.show()

Now we can apply some colors to draw attention to different parts of the data. I wouldn’t use all of these in real life but I’m including them so you can see how they could be layered using axvspan .

mu, sigma = 0, 1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)

abs(mu - np.mean(s))

abs(sigma - np.std(s, ddof=1))

count, bins, ignored = plt.hist(s, 30, density=True, alpha=.5)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
               np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
         linewidth=2, color='r')
plt.axvspan(-4, -.67, color='g', alpha=0.1)
plt.axvspan(-.67, 0, color='g', alpha=0.2)
plt.axvspan(0, .67, color='g', alpha=0.3)
plt.axvspan(.67, 4, color='g', alpha=.4)
plt.show()

Another graphical tool for numerical data is the box plot. This plot typically shows five numbers: the minimum value, the 25th percentile, the median, the 75th percentile, and the maximum value.

The 25th percentile is the number such that (approximately) 25% of the data falls below it and (approximately) 75% of the data falls above it.

Outliers, data values that are extremely small or large compared to the rest of the data, are typically plotted separately.

fig1, ax1 = plt.subplots()
ax1.set_title('Basic Plot')
ax1.boxplot(s, showfliers=False, vert=False)
Data Visualization
Statistics
Analytics
Python
Matplotlib
Recommended from ReadMedium