Do You Know What is Shannon’s Entropy?
A Complete Comprehensive Guide

Entropy forms the basis of the universe and everything in it. Why should deep learning be any different? It is highly used in information theory (the variant of entropy that’s used there is Shannon’s Entropy) and has made way into deep learning (Cross-Entropy Loss and KL Divergence) also. Let’s understand the concept of Shannon’s Entropy.
What is Entropy?
In layman terms, you describe entropy as:

The most basic example you get is of a fair coin: when you toss it, what will you get? Heads (1) or Tails (0).
Because the probability of both events is the same (1/2). There is no way to tell. So, if you are playing this coin-tossing game on the phone. Then, you must tell the other a single outcome or you need a single bit (0 or 1) to convey the information about this.
Let’s consider the case of a completely biased coin (would always end up on heads):
Would you need to tell the other person the result? The simple answer is NO. Because both of you know the answer(if you are not cheating though 😄). Hence, you need not say anything. You are not using even a single bit.
Digesting Entropy Intuitively
In order to digest entropy intuitively, let’s again revisit its definition:

And its mathematical definition:

There is one pertinent thing to notice, the word ‘system’ in the layman definition and the occurrence of probability term two times though differently.
Let’s revisit the definition of probability. What is probability? The likelihood of an event to occur. The definition of probability talks about a single event, not the whole system. Thus, what probability can give us is a local picture, a limited picture of the whole system.
Probability gives a local picture of the whole system
In order for us to get a sense of the whole system, we need to come up with a way that tells us a global picture of the whole system. We need to evaluate the parts of the system and see their effect in summation.
How do you evaluate a part of the system for the randomness it will contribute to the system?
The pᵢ part of the Entropy’s formula tells us about the importance of the event in the whole system. How? The numerator of a probability value tells the number of times the event happens, that’s how.
For the next part, let us look at the formula of entropy in a new light:

We can transform the second part of Entropy’s formula into the reciprocal of the probability. What would this give us? If probability tells you the certainty, what would its inverse signify? The uncertainty.
Therefore, the formula of Entropy can be interpreted as:

And that is what, entropy can be interpreted as the product of the importance of an event and the uncertainty the event has. The more important an event in the system, the more uncertainty i.e. randomness, it can introduce into the system. Therefore, if we derive this randomness for each event and sum it up, we can then get to understand the randomness in the whole system.
Digesting Entropy Mathematically
The mathematical formula of Shannon’s entropy is:

Here, c is the number of different classes you have. In the case of a coin, we have heads (1) or tails (0). Hence, c = 2.
So, the entropy of a fair coin is:

So, the entropy for the fair coin case comes out to be 1. Utter uncertainty (remember, the layman definition of entropy). We are totally uncertain about the result.
Now, let’s consider the case of the completely biased coin. For it, entropy is:

First of all, 0log(0) in all calculations concerning entropy is assumed to be 0. Now, we have quelled that, it can be observed for this case, entropy is 0. We are completely certain about the result, no matter what.
What about when the coin is not completely biased?
Let’s plot the resultant entropy when the probability of getting a head is between 0 and 1.

We see that Entropy becomes maximum at probability 1/2.
In the general case, the maximum value of entropy for any system is log₂c. This happens when probability is 1/2. Why?
There are two explanations for it. Let’s see.
Intuitive Explanation:
In the case of a coin, thus, the maximum entropy will be log₂2 = 1 bit.
When everything is equally likely (at probability 1/2), the entropy is highest because you don’t know what’s going to happen.
To convey all states the coin can take i.e. 0 or 1, you need 1 bit.
Mathematical Explanation:
Feel free to skip this, if not interested.
Remember how to find the value at which a given function’s value will be maximum.
If not, let’s revise:
Step 1: Take the derivative of the function and equate it to zero, the values you get will be either the minimum or maximum of the function. How to know if the values found are minimum or maximum?
Step 2: You take the double-derivative of the function and then, substitute the values you found out in the previous step. If the value of the second derivative is less than 0, then, the value will the maximum of the function and if it is more than 0, then, the value will be the minimum of the function.
Now, let’s apply this procedure for the formula of Shannon’s Entropy.
The derivative of Shannon’s Entropy is:

Equating the derivative found to 0,

For the last step, we raise both sides to the power of 2, and 2 raised to power log₂x is x.
Now, let’s verify whether the value is maximum or minimum. First, we find the double-derivative of Entropy:

Now, we substitute the value found from equating derivative to 0 into this double-derivative,

Finally, we get the value of double-derivative which is negative, which hence, confirms that the maximum value of Entropy is at probability 1/2.
We thus prove that the maximum value of Entropy occurs at probability 1/2.
Summary
In this post, we understand Shannon’s entropy both mathematically and intuitively. We understand what the bounds of Shannon’s entropy are mathematically. We also derived the probability at which Shannon’s entropy is maximum. We then understood the concept of entropy intuitively.
