All You Need to Know about Batch Size, Epochs and Training Steps in a Neural Network

And the connection between them explained in plain English with examples

There might be multiple times you’ve searched Google for batch size, epochs and iterations.

If you’re a beginner in the deep learning field, you will easily get confused with these technical terms. This is because you still do not know the connection between them.

In the context of deep learning, batch size, epochs and training steps are called model hyperparameters that we need to configure manually. In addition to that, these are common hyperparameters that almost every neural network model has.

It is very important to know the exact meanings of these hyperparameters because we need to configure their values manually during the training process in the fit() method as follows.

In the above code block,

The batch_size refers to batch size.
The epochs refers to the number of epochs.
The steps_per_epoch refers to training steps in one epoch.

In addition, shuffle is also a hyperparameter (more on this shortly).

I trained a neural network model with the above hyperparameter values and its output looks as follows.

You have already seen these types of outputs when calling the fit() method of neural network models during the training.

Let’s understand each term one by one, and, finally, the connection between them.

The connection between batch size, epochs and training steps

Neural networks are trained on large datasets with thousands or millions of samples (observations/instances). When the dataset is large, it would be time-consuming and computationally expensive to use the entire dataset for each gradient update during the training process. Sometimes, very large datasets will not fit in the computer’s memory.

As a solution for this, we use batches — portions of the dataset instead of the entire dataset to perform gradient updates during training.

A batch can contain many training instances (samples).

Batch size refers to the number of training instances in the batch.

For example, batch_size=128means that there are 128 training instances in each batch.

You should not get confused with batch size and the number of batches! The number of batches is calculated as follows.

No. of batches = (Size of the entire dataset / batch size) + 1

It is very clear that the number of batches depends on two factors: Size of the entire dataset and Batch size. In addition to that, you may add 1 to compensate for any fractional part.

Not clear? Let’s take an example.

Imagine that there are 60,000 instances in the training dataset. To calculate the number of batches, we simply divide the size of the dataset by batch size which is 128 here.

60,000 / 128 = 468.75

We got a fractional part! So, we need to add 1 to 468 to get the number of batches. If you do not get a fractional part, you don’t need to add 1.

No. of batches = 468 + 1 = 469

So, there are 469 batches in this example. That number was highlighted by an orange color box in the above output.

In this setting, 468 batches have 128 training instances each and the remaining batch (comes due to fractional part) has only 96 [60,000-(128x468)] training instances.

During training, batches of data are passed through the neural network and the error calculated by the loss function is propagated backward through the network to update the parameters of the network so that the network can make better predictions in the next steps.

Let’s discuss how this happens behind the scenes.

Imagine that the data is ready for training the model. It has 60,000 training instances and the batch size is 128.

Before starting drawing any batches from the dataset, the algorithm randomly shuffles the training data if we set shuffle=Truein the fit() method.

Next, the algorithm starts drawing batches from the dataset. It takes the first 128 instances (first batch) from the dataset, trains the model, calculates the average error and updates parameters one time (perform one gradient update). This completes one training step (also called iteration).

More precisely, a training step (iteration) is one gradient update.

Then, the algorithm takes the second 128 instances (second batch) from the dataset, trains the model, calculates the average error and updates parameters one time (perform another gradient update). This completes another training step.

The algorithm keeps doing this procedure until all batches are drawn from the training dataset. That’s 469 times according to our example! That concludes one epoch in training. In an epoch, the entire dataset is shown to the model.

More specifically, epochs refer to the number of times the model sees the entire dataset.

The number of gradient updates performed in an epoch is equal to the number of training steps (iterations) in that epoch.

It is very clear that epochs and iterations are two different things.

Now, we can write the following equation considering one epoch of training.

No. of training steps = No. of batches = No. of gradient updates

When the number of batches is 469, as in our example, there are 469 training steps or 469 gradient updates in one epoch of training.

In our example, there are 20 epochs. So far, we just completed only one epoch of training.

Next, the algorithm is ready for the second epoch. Again, it randomly shuffles the training data at the start of this epoch. Here also, the algorithm does the same procedure as it did in the first epoch.

When the algorithm reaches all 20 epochs (the entire dataset has been shown to the model 20 times), it concludes the entire training process.

The number of all gradient updates or the number of all training steps during the entire training process can be calculated as follows.

No. of ALL gradient updates = No. of batches x No. of epochs

So, in our example, the algorithm has performed 9380 (469x20) gradient updates or the algorithm has completed 9380 training steps during the entire training process.

Determining the right batch size

When the batch size increases,

The algorithm performs stable gradient updates.
The algorithm takes more time to complete each training step (iteration).
The entire training process is computationally expensive and time-consuming.

In tf.keras, batch size is specified by using the batch_size hyperparameter (argument) in the fit() method of the model.

The batch_size accepts an integer or None. When Noneor unspecified, it will default to 32. Other popular integers for the batch_size are 16, 64, 128 and 256.

The minimum value for the batch size is 1 which uses each and every training instance to perform one gradient update. Here, the number of gradient updates or the number of training steps or the number of batches in one epoch is equal to the size of the full training dataset!

The maximum value for the batch size is the size of the full training dataset. The entire training dataset will be used to perform one gradient update. Here, the number of gradient updates or the number of training steps or the number of batches in one epoch is equal to 1.

We can use any integer between the minimum and maximum. The number of gradient updates or the number of training steps or the number of batches in one epoch depends on the batch size we selected.

Batch size and variants of gradient descent

Gradient descent is an iterative optimization algorithm that is used to train ML and DL models. It updates the model parameters against the loss function. The batch_size and epochs are the main hyperparameters of the gradient descent algorithm. We specified them in the fit() methods of the model as I mentioned earlier.

When batch size is equal to the size of the full training dataset, the gradient descent algorithm is called batch gradient descent. The model parameters are updated once after each epoch of training, which is computationally efficient.

When the batch size is equal to 1, the gradient descent algorithm is called stochastic gradient descent. The model parameters are updated after each training example in one epoch, which is computationally very expensive.

When the batch size is between 1 and the size of the full training dataset, the gradient descent algorithm is called mini-batch gradient descent.

Determining the right number of epochs

In tf.keras, the number of epochs is specified by using the epochs hyperparameter (argument) in the fit() method of the model. It accepts an integer.

The algorithm needs an adequate number of epochs to complete the training process properly. Initially, we set the number of epochs to an integer like 5, 10, 20, 50 etc. and then monitor the train and validation losses against the number of epochs.

If you get a plot like this, it is better to increase the number of epochs and monitor the losses again. This is because both train and validation losses seem to decrease simultaneously and further decrease may be possible by increasing the number of epochs.

At this time, the model may underfit the training data. When this happens, it neither performs well on training data nor validation data.

Now consider the following plot.

If you get a plot like this, you have trained the model excessively. This is because both train and validation losses do not seem to decrease further after the 20th epoch. So, it is adequate to use just 20 epochs for training the model on this occasion.

Now consider the following plot.

If you get a plot like this, you must stop the training process (early stopping) at the 5th epoch as the validation loss begins increasing after that epoch. Otherwise, your model will overfit the training data. If that happens, the model will perform well on training data but it will poorly perform on new unseen data.

In general, when increasing the number of epochs, the entire training process is computationally expensive and time-consuming.

Determining the training steps in one epoch

In tf.keras, the number of training steps in one epoch is specified by the steps_per_epoch hyperparameter (argument) in the fit() method of the model. It accepts an integer or None. The default is set to None. You do not need to specify a value for this hyperparameter because the algorithm automatically calculates the value using the following equation.

No. of steps = (Size of the entire dataset / batch size) + 1

In fact, the number of training steps in one epoch is equal to the number of batches!

Anyway, if we specify a value for the steps_per_epoch hyperparameter, that will overwrite the default one. For example, if we set steps_per_epoch=500, the algorithm will use 500 iterations (batches) in each epoch instead of the 469 iterations (batches) that were used in the above example.

This is the end of today’s post.

Please let me know if you’ve any questions or feedback.

Support me as a writer

I hope you enjoyed reading this article. If you’d like to support me as a writer, kindly consider signing up for a membership to get unlimited access to Medium. It only costs $5 per month and I will receive a portion of your membership fee.

Join Medium with my referral link - Rukshan Pramoditha

Read every story from Rukshan Pramoditha (and thousands of other writers on Medium). Your membership fee directly…

rukshanpramoditha.medium.com

Thank you so much for your continuous support! See you in the next article. Happy learning to everyone!

Rukshan Pramoditha 2022–08–26