avatarMackenzie Mitchell

Summarize

Statistical Distributions

Breaking down discrete and continuous distributions and looking into how data scientists can apply statistics most efficiently.

What is a Probability Distribution?

A probability distribution is a mathematical function that provides the probabilities of the occurrence of various possible outcomes in an experiment. Probability distributions are used to define different types of random variables in order to make decisions based on these models. There are two types of random variables: discrete and continuous. Depending on what category the random variable fits into, a statistician may decide to calculate the mean, median, variance, probability, or other statistical calculations using a different equation associated with that type of random variable. This is important because, as experiments may become more complicated, the standard formulas that are used to calculate these parameters (like the mean) will no longer produce accurate results.

A continuous distribution (Normal Distribution) vs. a discrete distribution (Binomial Distribution)

Discrete Distributions

A discrete distribution displays the probabilities of the outcomes of a random variable with finite values and is used to model a discrete random variable. Discrete distributions can be laid out in tables and the values of the random variable are countable. These distributions are defined by probability mass functions.

The probability mass function (or pmf) calculates the probability that the random variable will assume the one specific value that it is being calculated at: Pr(X=a). An example of some graphic representations of discrete distributions is displayed below. For any value of x in the discrete framework, there is one probability that corresponds to that specific observation.

Continuous Distributions

A continuous distribution displays the ranges of probabilities for the outcomes of a random variable with infinite values and is used to model a continuous random variable. Continuous distributions measure something, rather than just count. In fact, these types of random variables are uncountable and the probability of a continuous random variable at one specific point is zero. Continuous distributions are typically described by probability distribution functions.

The probability density function (or pdf) is a function that is used to calculate the probability that a continuous random variable will be less than or equal to the value it is being calculated at: Pr(a≤X≤b) or Pr(X≤b). In order to calculate these probabilities, we must integrate the pdf over the range [a-b] or [0-b], respectively. Some examples of graphical representations of continuous distributions are displayed below. It may be observed that these graphs possess a curved nature, with no distinct values for each x value. This is because, at any given specific x value or observation in a continuous distribution, the probability is zero. We are only able to calculate the probability that a continuous random variable lies within a range of values. It should also be noted that the area underneath the curve is equal to one because this represents the probability of all outcomes.

Expected Value or Mean

Each of the distributions, whether continuous or discrete, has different corresponding formulas that are used to calculate the expected value or mean of the random variable. The expected value of a random variable is a measure of the central tendency of the random variable. Another term to describe the expected value is the ‘first moment’. Most of these formulas do not typically work as one would expect intuitively, this is due to the context in which the distribution positions us in. It is important to remember that the expected value is the value that one expects a random variable to be.

In order to calculate the expected value by hand for a discrete random variable, one must multiply each value of the random variable by the probability of that value (or the pmf), and then sum all of those values. For example, if we had a discrete random variable X with (values,probabilities): [(1,0.2),(2,0.5),(3,0.3)], then E[X] (or the expected value of X) is equal to (1 * 0.2) + (2 * 0.5) + (3 * 0.3)= 2.1. This strategy can be thought of as taking a weighted average of all the values that X can assume.

In order to calculate the expected value “by hand” for a continuous random variable, one must integrate x multiplied by the random variable’s pdf (probability distribution function) over the entire domain of X. If you recall, the integral of the pdf over the entire domain results in a value of 1, because that is calculating the probability of the random variable assuming ANY of the values in its domain. This is similar to the concept that if we add up all of the probabilities for each value of a discrete random variable, X, without multiplying by each corresponding value of x, then the sum will equal 1. Multiplying by x in the integral allows us to take the value into account in the same way that multiplying by x in a summation for a discrete variable allows us to take the value into account. When we multiply the pdf by x, we are given a weighted average of all of the possible observations of the random variable X.

Variance

Each defined random variable has a variance associated with it as well. This is a measure of the concentration of the observations within that random variable. This number will give us intel on how far spread the observations are from the mean. The variance of a constant is zero, because the mean of a constant is equal to the constant, and there is only one observation that is exactly the mean. The standard deviation is also useful, this is equal to the square root of the variance. When calculating variance, the idea is to calculate how far each observation of the random variable is from it’s expected value, square it, then take the average of all of these squared distances. The formula for variance is as follows:

When beginning to study statistics and probability, the number of distributions and their respective formulas can become very overwhelming. It is important to note that if we know a random variable follows a defined distribution, we can simply use their formulas for mean or variance (or sometimes even their parameters) to calculate these values. However, if it is not readily apparent that a random variable follows a defined distribution, it is best to calculate the mean and variance, or any other values you may need in your analysis, using the basic formulas that apply summations and integrals in which the random variable x is multiplied by the pmf or pdf for each observation and is either summed or integrated respectively. In order to find the variance, follow the formula for variance and obtain the second moment of the random variable by using the same procedure for the first moment, however, replacing x with x². For example, sometimes it is readily apparent with some examination or we are told that a random variable follows a certain distribution, while other times we are just given the pmf or pdf that may not look familiar to match to a known distribution.

It becomes easier to recognize random variables’ pmfs and pdfs with practice using these random variables. Being able to quickly recognize the various distributions in practice is advantageous as it can save a lot of time and help statisticians and actuaries become more efficient.

For data scientists, however, there are handy classes for every probability distribution you would ever need to work with within the library scipy.stats in Python that can be used to easily visualize and work with these distributions. We are able to generate random variables that follow specified distributions and visualize the distributions graphically. Some examples of discrete and continuous distribution visualizations with the Python code to obtain these visualizations are displayed below.

First, matplotlib.pyplot and seaborn should be imported
From there, we are able to import the distribution we want to examine, define it’s parameters, and plot the distribution using seaborn’s distplot. Here, we look at the continuous distribution called Uniform Distribution.
Here we examine one of the most common continuous distributions, the Normal Distribution.
Here we look at a discrete distribution called the Binomial Distribution.
Here we look at another discrete distribution that may be less common, the Logser Distribution.

There are also “masked statistics functions” built into scipy.stats that enable us to quickly calculate different characteristics of random variables (mean, standard deviation, variance, coefficient of variation, kurtosis, etc.). Lastly, we are able to easily apply different transformations to our data such as the BoxCox transformation or z-score transformation by using this Python library.

Let’s connect:

https://www.linkedin.com/in/mackenzie-mitchell-635378101/

References:

Data Science
Actuary
Statistics
Probability
Actuarial Science
Recommended from ReadMedium