Understand the Softmax Function in Minutes

Learning machine learning? Specifically trying out neural networks for deep learning? You likely have run into the Softmax function, a wonderful activation function that turns numbers aka logits into probabilities that sum to one. Softmax function outputs a vector that represents the probability distributions of a list of potential outcomes. It’s also a core element used in deep learning classification tasks. We will help you understand the Softmax function in a beginner friendly manner by showing you exactly how it works — by coding your very own Softmax function in python.
If you are implementing Softmax in Pytorch and you already know Pytorch well, scroll down to the Deep Dive section and grab the code. Prefer watching a youtube video? Scroll down to the youtube video.
This article has gotten really popular: 5800+ claps. It is updated constantly. Latest update Jan 2020 added a TL;DR section for busy souls. Dec 2019 (Softmax with Numpy Scipy Pytorch functional. Visuals indicating the location of Softmax function in Neural Network architecture.) and full list of updates below. Your feedback is welcome! You are welcome to translate it and cite it. We would appreciate it if the English version is not reposted elsewhere. A link back is always appreciated. Comment below and share your links so that we can link to you in this article. Clap for us on Medium. Thank you in advance for your support!

Skill pre-requisites: the demonstrative codes are written with Python list comprehension (scroll down to see an entire section explaining list comprehension). The math operations demonstrated are intuitive and code agnostic: it comes down to taking exponentials, sums and division aka the normalization step. This article is for your personal use only, not for production or commercial usage. Please read our disclaimer.

The above Udacity lecture slide shows that Softmax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1], and the probabilities sum to 1.
In deep learning, the term logits layer is popularly used for the last neuron layer of neural network for classification task which produces raw prediction values as real numbers ranging from [-infinity, +infinity ]. — Wikipedia
Logits are the raw scores output by the last layer of a neural network. Before activation takes place.
TL;DR:
Softmax turn logits (numeric output of the last linear layer of a multi-class classification neural network) into probabilities by take the exponents of each output and then normalize each number by the sum of those exponents so the entire output vector adds up to one — all probabilities should add up to one. Cross entropy loss is usually the loss function for such a multi-class classification problem. Softmax is frequently appended to the last layer of an image classification network such as those in CNN ( VGG16 for example) used in ImageNet competitions.
Here’s the numpy python code for Softmax function.
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0) 
Like what you read? You can now tip us via buymecoffee
We will top that with one month free access to our articles and invite-only blog.
Above is the visual.
Softmax is not a black box. It has two components: special number e to some power divide by a sum of some sort.
y_i refers to each element in the logits vector y. Python and Numpy code will be used in this article to demonstrate math operations. Let’s see it in code:
logits = [2.0, 1.0, 0.1]
import numpy as np
exps = [np.exp(i) for i in logits]We use numpy.exp(power) to take the special number eto any power we want. We use python list comprehension to iterate through each i of the logits, and compute np.exp(i). If you are not familiar with Python list comprehension, read the explanation in the next section first. Logit is another name for a numeric score. The result is stored in a list called exps. The variable name is short for exponentials.
Why not just divide each logits by the sum of logits? Why do we need exponents? Logits is the logarithm of odds (wikipedia https://en.wikipedia.org/wiki/Logit) see the graph on the wiki page, it ranges from negative infinity to positive infinity. When logits are negative, adding it together does not give us the correct normalization. exponentiate logitsturn them them zero or positive!
e**(100) = 2.6881171e+43
e**(-100) = 3.720076e-44 # a very small number
3.720076e-44 > 0 # still returns trueBy the way, special number e exponents also makes the math easier later! Logarithm of products can be easily turned into sums for easy summation and derivative calculation. log(a*b)= log(a)+log(b)
Replacing i with logit is another verbose way to write outexps = [np.exp(logit) for logit in logits] . Note the use of plural and singular nouns. It’s intentional.
We just computed the top part of the Softmax function. For each logit, we took it to an exponential power of e. Each transformed logit j needs to be normalized by another number in order for all the final outputs, which are probabilities, to sum to one. Again, this normalization gives us nice probabilities that sum to one!
We compute the sum of all the transformed logits and store the sum in a single variable sum_of_exps, which we will use to normalize each of the transformed logits.
sum_of_exps = sum(exps)Now we are ready to write the final part of our Softmax function: each transformed logit jneeds to be normalized by sum_of_exps , which is the sum of all the logits including itself.
softmax = [j/sum_of_exps for j in exps]Again, we use python list comprehension: we grab each transformed logit using [j for j in exps]divide each j by the sum_of_exps.
List comprehension gives us a list back. When we print the list we get
>>> softmax
[0.6590011388859679, 0.2424329707047139, 0.09856589040931818]
>>> sum(softmax)
1.0The output rounds to [0.7, 0.2, 0.1] as seen on the slide at the beginning of this article. They sum nicely to one!
Softmax in the Forward Fucntion

Here’s another perspective of the Softmax function location in a neural network as represented by matrix operations. Source CS 231n Stanford CNN class. Note in the bottom right box: the first column vector A is the result of the matmul(W, X) + b then each component is exponentiated to generate column vector B, then normalized by the sum of B to get column vector C — the Softmax vector which always sums to 1.
Functional Implementation of Softmax Function
Implementation Softmax Using Numpy
Now that you know the pythonic way to implement Softmax can you implement it using Numpy?
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)Implementation of Softmax in Scipy
Below is the name of the API and its Numpy equivalent, specified on the Scipy Documentation. Source
scipy.special.softmaxsoftmax(x) = np.exp(x)/sum(np.exp(x))Extra — Understanding List Comprehension
This post uses a lot of Python list comprehension which is more concise than Python loops. If you need help understanding Python list comprehension type the following code into your interactive python console (on Mac launch terminal and type python after the dollar sign $ to launch).
sample_list = [1,2,3,4,5]
# console returns Nonesample_list # console returns [1,2,3,4,5]#print the sample list using list comprehension
[i for i in sample_list] # console returns [1,2,3,4,5]
# note anything before the keyword 'for' will be evaluated
# in this case we just display 'i' each item in the list as is
# for i in sample_list is a short hand for
# Python for loop used in list comprehension[i+1 for i in sample_list] # returns [2,3,4,5,6]
# can you guess what the above code does?
# yes, 1) it will iterate through each element of the sample_list
# that is the second half of the list comprehension
# we are reading the second half first
# what do we do with each item in the list?
# 2) we add one to it and then display the value
# 1 becomes 2, 2 becomes 3# note the entire expression 1st half & 2nd half are wrapped in []
# so the final return type of this expression is also a list
# hence the name list comprehension
# my tip to understand list comprehension is
# read the 2nd half of the expression first
# understand what kind of list we are iterating through
# what is the individual item aka 'each'
# then read the 1st half
# what do we do with each item# can you guess the list comprehension for
# squaring each item in the list?
[i*i for i in sample_list] #returns [1, 4, 9, 16, 25]To summarize:
Understanding List Comprehension This post uses a lot of Python list comprehension which is more concise than Python loops. If you need help understanding Python list comprehension type the following code into your interactive python console (on Mac launch terminal and type python after the dollar sign $ to launch). Python code: \n `sample_list = [1,2,3,4,5]; [i*i for i in sample_list]; ` # this code first assigns [1,2,3,4,5] to a variable called sample_list. In the second statement, the use of list comprehension, the keyword ‘for’ will be evaluated, in this case we will display ‘i’, each item in the list, for i in sample_list is a short hand of Python for loop. Print each item we can just use [i for i in sample_list]. If we want to add 1 to i, we can use this code [i+1 for i in sample_list]. This will return [2,3,4,5,6]. To summarize, to use list comprehension, 1) we iterate through each element of the sample (the second half of the list comprehension). In other words, we read and evalluate the second half of the list comprehension statement first. 2) Then we evaluate the first part of the statement, which states what we want to do with each item returned by the second half of the list comprehension. Notice the entire expression (1st half & 2nd half) are wrapped in square brackets [] so the final return type of this expression is a list. Hence the name list comprehension. Our tip for understanding list comprehension is: 1) read the 2nd half of the expression. Understand what kind of list we are iterating through. What is the individual item aka ‘each’ 2) then read the 1st half. What do we do with each item. \n Can you guess the list comprehension for squaring each item in the list? `[i*i for i in sample_list]`. This will return [1, 4, 9, 16, 25].
Intuition and Behaviors of Softmax Function
If we hard code our label data to the vectors below, in a format typically used to turn categorical data into numbers, the data will look like this format below.
[[1,0,0], #cat
[0,1,0], #dog
[0,0,1],] #birdOptional Reading: FYI, this is an identity matrix in linear algebra. Note that only the diagonal positions have the value 1 the rest are all zero. This format is useful when the data is not numerical, the data is categorical, each category is independent from others. For example, 1 star yelp review, 2 stars, 3 stars, 4 stars, 5 starscan be one hot coded but note the five are related. They may be better encoded as 1 2 3 4 5 . We can infer that 4 stars is twice as good as 2 stars. Can we say the same about name of dogs? Ginger, Mochi, Sushi, Bacon, Max , is Macon 2x better than Mochi? There’s no such relationship. In this particular encoding, the first column represent cat, second column dog, the third column bird.
The output probabilities are saying 70% sure it is a cat, 20% a dog, 10% a bird. One can see that the initial differences are adjusted to percentages. logits = [2.0, 1.0, 0.1]. It’s not 2:1:0.1. Previously, we cannot say that it’s 2x more likely to be a cat, because the results were not normalized to sum to one.
The output probability vector is [0.7, 0.2, 0.1] . Can we compare this with the ground truth of cat [1,0,0] as in one hot encoding? Yes! That’s what is commonly used in cross entropy loss (We have a cool trick to understand cross entropy loss and will write a tutorial about it. Read it here.). In fact cross entropy loss is the “best friend” of Softmax. It is the most commonly used cost function, aka loss function, aka criterion that is used with Softmax in classification problems. More on that in a different article.
Why do we still need fancy machine learning libraries with fancy Softmax function? The nature of machine learning training requires ten of thousands of samples of training data. Something as concise as the Softmax function needs to be optimized to process each element. Some say that Tensorflow broadcasting is not necessarily faster than numpy’s matrix broadcasting though.
Watch this Softmax tutorial on Youtube
Visual learner? Prefer watching a YouTube video instead? See our tutorial below.





