DL Tutorial 4 — Feedforward Neural Networks and Backpropagation

Learn how feedforward neural networks are trained using backpropagation algorithm.

Table of Contents 1. Introduction 2. What is a Feedforward Neural Network? 3. How Does a Feedforward Neural Network Work? 4. What is Backpropagation? 5. How Does Backpropagation Work? 6. Why is Backpropagation Important? 7. Challenges and Limitations of Backpropagation 8. Conclusion

Subscribe for FREE to get your 42 pages e-book: Data Science | The Comprehensive Handbook

Get step-by-step e-books on Python, ML, DL, and LLMs.

1. Introduction

In this tutorial, you will learn how feedforward neural networks are trained using the backpropagation algorithm. You will also learn the basic concepts and terminology of neural networks, such as neurons, weights, biases, activation functions, and loss functions. By the end of this tutorial, you will be able to implement a simple feedforward neural network in Python and use it to solve a classification problem.

Feedforward neural networks are one of the most widely used types of artificial neural networks. They are composed of multiple layers of interconnected nodes, or neurons, that perform mathematical operations on the input data and produce an output. Feedforward neural networks can learn to approximate any function, given enough data and computational resources. They are often used for tasks such as image recognition, natural language processing, and speech synthesis.

Backpropagation is the algorithm that allows feedforward neural networks to learn from data and adjust their parameters accordingly. It is based on the idea of calculating the error, or loss, of the network’s output and propagating it backwards through the network, updating the weights and biases of each neuron along the way. Backpropagation is an application of the chain rule of calculus, which allows us to compute the derivative of a complex function by multiplying the derivatives of its simpler components.

To follow this tutorial, you will need a basic understanding of Python and data analysis. You will also need to install the following libraries:

- NumPy: A library for scientific computing and working with arrays. — Matplotlib: A library for plotting and visualizing data. — Scikit-learn: A library for machine learning and data mining.

You can install these libraries using the pip command:

# Install the libraries
pip install numpy matplotlib scikit-learn

Ready to learn how feedforward neural networks and backpropagation work? Let’s get started!

2. What is a Feedforward Neural Network?

A feedforward neural network is a type of artificial neural network that consists of multiple layers of nodes, or neurons, that process the input data and produce an output. The term feedforward means that the data flows in one direction, from the input layer to the output layer, without any feedback loops or cycles. A feedforward neural network can be represented as a directed acyclic graph, where each node represents a neuron and each edge represents a connection between neurons.

A neuron is a basic unit of computation in a neural network. It receives one or more inputs, performs a weighted sum of them, adds a bias term, and applies a non-linear activation function to produce an output. The weights and biases are the parameters of the neuron that determine how it responds to the inputs. The activation function is a mathematical function that introduces non-linearity into the network, allowing it to learn complex patterns and functions.

A feedforward neural network can have one or more hidden layers between the input and output layers. The hidden layers are not directly connected to the external data, but they perform intermediate computations and transformations on the input data. The number and size of the hidden layers determine the complexity and capacity of the network. A network with more hidden layers and neurons can learn more complex functions, but it also requires more data and computational resources to train.

The output layer of a feedforward neural network produces the final output of the network, which can be a single value or a vector of values. The output layer can have different activation functions depending on the type of problem the network is trying to solve. For example, for a regression problem, where the network is trying to predict a continuous value, the output layer can have a linear activation function. For a classification problem, where the network is trying to assign a discrete label to the input, the output layer can have a softmax activation function, which produces a probability distribution over the possible classes.

The following diagram shows an example of a feedforward neural network with one input layer, one hidden layer, and one output layer. Each circle represents a neuron, and each arrow represents a connection with a weight. The bias terms are not shown in the diagram.

# Import the libraries
    import numpy as np
    import matplotlib.pyplot as plt

    # Define the network architecture
    input_size = 2 # Number of input neurons
    hidden_size = 3 # Number of hidden neurons
    output_size = 2 # Number of output neurons

    # Define the network parameters
    W1 = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]) # Weights from input to hidden layer
    b1 = np.array([0.7, 0.8, 0.9]) # Biases of hidden layer
    W2 = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]) # Weights from hidden to output layer
    b2 = np.array([0.7, 0.8]) # Biases of output layer

    # Define the activation functions
    def sigmoid(x):
        # Sigmoid function
        return 1 / (1 + np.exp(-x))

    def softmax(x):
        # Softmax function
        exp_x = np.exp(x)
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)

    # Define the input data
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) # Input data matrix
    y = np.array([[0, 1], [1, 0], [1, 0], [0, 1]]) # Output data matrix

    # Plot the input data
    plt.scatter(X[:, 0], X[:, 1], c=np.argmax(y, axis=1), cmap=plt.cm.coolwarm)
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.title('Input data')
    plt.show()

The plot should show the input data, where each point represents an input vector with two features, x1 and x2. The color of the point indicates the output class, 0 or 1.

To compute the output of the network, we need to perform a forward pass, which involves the following steps:

- Multiply the input data matrix X by the weight matrix W1 and add the bias vector b1 to get the hidden layer input Z1. — Apply the sigmoid activation function to Z1 to get the hidden layer output A1. — Multiply the hidden layer output A1 by the weight matrix W2 and add the bias vector b2 to get the output layer input Z2. — Apply the softmax activation function to Z2 to get the output layer output A2, which is also the final output of the network.

The following code shows how to perform a forward pass in Python:

# Perform a forward pass
    Z1 = X.dot(W1) + b1 # Hidden layer input
    A1 = sigmoid(Z1) # Hidden layer output
    Z2 = A1.dot(W2) + b2 # Output layer input
    A2 = softmax(Z2) # Output layer output

    # Print the output
    print(A2)

The output is a matrix of shape (4, 2), where each row represents the probability distribution over the two classes for each input vector. For example, the first row [0.525, 0.475] means that the network assigns a 52.5% probability to the first class and a 47.5% probability to the second class for the input vector [0, 0].

    [[0.525 0.475]
     [0.524 0.476]
     [0.523 0.477]
     [0.522 0.478]]

As you can see, the network is not very confident about its predictions, and it assigns almost equal probabilities to both classes for all inputs. This is because the network has not been trained yet, and its parameters are randomly initialized. To train the network, we need to use the backpropagation algorithm, which we will learn in the next section.

3. How Does a Feedforward Neural Network Work?

In this section, you will learn how a feedforward neural network works by following an example of a simple network with one input layer, one hidden layer, and one output layer. You will also learn the mathematical notation and formulas that are used to describe the network’s architecture and computations.

Let’s start by defining some symbols and variables that we will use throughout this tutorial:

Screenshot by Author since Medium doesn’t support LaTeX

Using these symbols and variables, we can write the general formulas for the forward pass of the network, which consists of two steps for each layer:

The following code shows how to perform the forward pass of the network in Python:

    # Import the libraries
    import numpy as np
    import matplotlib.pyplot as plt

    # Define the network architecture
    n_x = 2 # Number of input neurons
    n_h = 3 # Number of hidden neurons
    n_y = 2 # Number of output neurons

    # Define the network parameters
    W1 = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]) # Weights from input to hidden layer
    b1 = np.array([0.7, 0.8, 0.9]) # Biases of hidden layer
    W2 = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]) # Weights from hidden to output layer
    b2 = np.array([0.7, 0.8]) # Biases of output layer

    # Define the activation functions
    def sigmoid(x):
        # Sigmoid function
        return 1 / (1 + np.exp(-x))

    def softmax(x):
        # Softmax function
        exp_x = np.exp(x)
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)

    # Define the input data
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) # Input data matrix
    y = np.array([[0, 1], [1, 0], [1, 0], [0, 1]]) # Output data matrix

    # Perform a forward pass
    Z1 = X.dot(W1) + b1 # Hidden layer input
    A1 = sigmoid(Z1) # Hidden layer output
    Z2 = A1.dot(W2) + b2 # Output layer input
    A2 = softmax(Z2) # Output layer output

    # Print the output
    print(A2)

The output is the same as in the previous section, a matrix of shape (4, 2), where each row represents the probability distribution over the two classes for each input vector.

    [[0.525 0.475]
     [0.524 0.476]
     [0.523 0.477]
     [0.522 0.478]]

This is how a feedforward neural network works. It takes the input data and passes it through multiple layers of neurons, each performing a linear combination and a non-linear activation, until it reaches the output layer, which produces the final output of the network. In the next section, we will learn how to train the network using the backpropagation algorithm.

4. What is Backpropagation?

The goal of backpropagation is to minimize the loss function, which measures how well the network’s output matches the true output. The loss function depends on the type of problem the network is trying to solve. For example, for a binary classification problem, we can use the binary cross-entropy loss function, which is defined as:

    # Define the binary cross-entropy loss function
    def binary_cross_entropy_loss(y_true, y_pred):
        # y_true is the true output data matrix
        # y_pred is the network output data matrix
        # m is the number of input vectors
        m = y_true.shape[0]
        # Compute the loss for each input vector
        loss = - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        # Compute the average loss over all input vectors
        loss = np.sum(loss) / m
        return loss

The binary cross-entropy loss function takes the true output data matrix y and the network output data matrix Y_hat as inputs, and returns a scalar value that represents the average loss over all input vectors. The lower the loss, the better the network’s performance.

To minimize the loss function, we need to find the optimal values of the network’s parameters, W1, b1, W2, and b2, that make the network’s output as close as possible to the true output. To do this, we use a technique called gradient descent, which involves the following steps:

Initialize the network’s parameters randomly.
Perform a forward pass to compute the network’s output and the loss function.
Perform a backward pass to compute the gradients of the loss function with respect to the network’s parameters.
Update the network’s parameters by subtracting a fraction of the gradients from the current values.
Repeat steps 2 to 4 until the loss function reaches a minimum or a convergence criterion is met.

The gradient of a function is a vector that points in the direction of the steepest ascent of the function. The gradient of the loss function with respect to the network’s parameters tells us how much the loss function changes when we change the parameters by a small amount. By subtracting a fraction of the gradients from the current values, we move the parameters in the opposite direction of the gradient, which is the direction of the steepest descent of the loss function. This way, we can gradually reduce the loss function and improve the network’s performance.

The fraction of the gradients that we subtract from the current values is called the learning rate, and it determines how fast or slow the network learns. A high learning rate can speed up the learning process, but it can also cause the network to overshoot the minimum and diverge. A low learning rate can prevent the network from diverging, but it can also slow down the learning process and get stuck in local minima. Choosing a suitable learning rate is an important and challenging task in neural network training.

The backward pass is the most crucial and complex part of the backpropagation algorithm. It involves applying the chain rule of calculus to compute the gradients of the loss function with respect to the network’s parameters. The chain rule allows us to decompose the gradient of a composite function into the product of the gradients of its simpler components. For example, if we have a function h(x) = f(g(x)), then the gradient of h with respect to x is the product of the gradient of f with respect to g and the gradient of g with respect to x. The formula is: h’(x) = f’(g(x)) * g’(x).

In the case of a feedforward neural network, the loss function is a composite function of the network’s parameters and the activation functions. Therefore, we can use the chain rule to compute the gradients of the loss function with respect to the network’s parameters by multiplying the gradients of the loss function with respect to the network’s output, the gradients of the network’s output with respect to the output layer input, the gradients of the output layer input with respect to the hidden layer output, and so on, until we reach the input layer. This process is called backpropagation because we start from the output layer and move backwards through the network, computing the gradients layer by layer.

In the next section, we will see how to implement the backpropagation algorithm in Python and apply it to train our simple feedforward neural network.

5. How Does Backpropagation Work?

In this section, you will learn how backpropagation works by following an example of a simple network with one input layer, one hidden layer, and one output layer. You will also learn the mathematical notation and formulas that are used to describe the network’s learning process and parameter updates.

Let’s start by defining some symbols and variables that we will use throughout this tutorial:

Using these symbols and variables, we can write the general formulas for the backpropagation algorithm, which consists of two steps for each layer:

After computing the gradients for each layer, we can update the parameters of the network using the following formulas:

The following code shows how to perform one iteration of the backpropagation algorithm in Python:

# Import the libraries
import numpy as np
import matplotlib.pyplot as plt

# Define the network architecture
n_x = 2 # Number of input neurons
n_h = 3 # Number of hidden neurons
n_y = 2 # Number of output neurons

# Define the network parameters
W1 = np.random.rand(n_x, n_h) # Weights from input to hidden layer
b1 = np.random.rand(n_h) # Biases of hidden layer
W2 = np.random.rand(n_h, n_y) # Weights from hidden to output layer
b2 = np.random.rand(n_y) # Biases of output layer

# Define the activation functions
def sigmoid(x):
    # Sigmoid function
    return 1 / (1 + np.exp(-x))

def softmax(x):
    # Softmax function
    exp_x = np.exp(x)
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

# Define the input data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) # Input data matrix
y = np.array([[0, 1], [1, 0], [1, 0], [0, 1]]) # Output data matrix

# Define the loss function
def cross_entropy(y_true, y_pred):
    # Cross-entropy function
    return -np.sum(y_true * np.log(y_pred))

# Define the learning rate
alpha = 0.1

# Define the number of epochs
epochs = 100

# Define the list to store the loss values
loss_values = []

# Loop over the epochs
for epoch in range(epochs):
    # Forward pass
    Z1 = X.dot(W1) + b1 # Linear combination from input to hidden layer
    A1 = sigmoid(Z1) # Activation at hidden layer
    Z2 = A1.dot(W2) + b2 # Linear combination from hidden to output layer
    A2 = softmax(Z2) # Activation at output layer
    # Backward pass
    dZ2 = A2 - y # Gradient at output layer
    dW2 = A1.T.dot(dZ2) # Gradient at W2
    db2 = np.sum(dZ2, axis=0) # Gradient at b2
    dZ1 = dZ2.dot(W2.T) * A1 * (1 - A1) # Gradient at hidden layer
    dW1 = X.T.dot(dZ1) # Gradient at W1
    db1 = np.sum(dZ1, axis=0) # Gradient at b1
    # Weight update
    W2 = W2 - alpha * dW2 # Update W2
    b2 = b2 - alpha * db2 # Update b2
    W1 = W1 - alpha * dW1 # Update W1
    b1 = b1 - alpha * db1 # Update b1
    # Compute the loss
    loss = cross_entropy(y, A2)
    # Print the loss every 10 epochs
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, loss {loss}")
    # Append the loss to the list
    loss_values.append(loss)

# Plot the loss values
plt.plot(loss_values)
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()

6. Why is Backpropagation Important?

Backpropagation is important because it is the key algorithm that enables feedforward neural networks to learn from data and improve their performance. Without backpropagation, neural networks would not be able to adjust their parameters and reduce their error, and they would remain stuck with random and ineffective weights and biases. Backpropagation allows neural networks to adapt to the data and find the optimal values of the parameters that minimize the loss function.

Backpropagation is also important because it is a general and powerful algorithm that can be applied to any feedforward neural network, regardless of its architecture, activation functions, or loss function. Backpropagation can handle networks with multiple hidden layers, different types of neurons, and various kinds of problems, such as regression, classification, or clustering. Backpropagation is based on the chain rule of calculus, which is a universal and simple rule that can be used to compute the derivative of any composite function.

Backpropagation is also important because it is the foundation of many advanced and modern neural network techniques, such as convolutional neural networks, recurrent neural networks, and deep learning. These techniques use backpropagation as the core algorithm to train their networks and achieve state-of-the-art results in various domains, such as computer vision, natural language processing, and speech recognition. Backpropagation is the common thread that connects these diverse and complex neural network models.

In summary, backpropagation is important because it is the algorithm that makes feedforward neural networks learn, adapt, and perform. Backpropagation is the essence of neural network training and the basis of many neural network applications.

7. Challenges and Limitations of Backpropagation

Backpropagation is a powerful and general algorithm that enables feedforward neural networks to learn from data and improve their performance. However, backpropagation also has some challenges and limitations that need to be addressed and overcome. In this section, we will discuss some of the common challenges and limitations of backpropagation and how to deal with them.

One of the main challenges of backpropagation is choosing the appropriate hyperparameters for the network and the learning process. Hyperparameters are the parameters that are not learned by the network, but are set by the user before the training. Some of the important hyperparameters are:

The network architecture, such as the number and size of the hidden layers, the type and number of neurons, and the activation functions.
The learning rate, which determines how fast or slow the network learns.
The number of epochs, which determines how long the network trains.
The batch size, which determines how many input vectors are processed at a time.
The regularization technique, which prevents the network from overfitting the data by adding a penalty term to the loss function.

Choosing the appropriate hyperparameters is not a trivial task, and it requires a lot of trial and error, experimentation, and evaluation. There is no universal formula or rule for setting the hyperparameters, and they depend on the type and complexity of the problem, the size and quality of the data, and the computational resources available. A common approach for finding the optimal hyperparameters is to use a grid search or a random search, which involves testing different combinations of hyperparameters and selecting the one that achieves the best performance on a validation set.

Another challenge of backpropagation is dealing with the gradient-related issues, such as the vanishing gradient problem and the exploding gradient problem. These problems occur when the gradients of the loss function with respect to the network’s parameters become either too small or too large, causing the network to stop learning or diverge. The vanishing gradient problem happens when the gradients become smaller and smaller as they propagate backwards through the network, especially when the network has many hidden layers and uses activation functions that saturate, such as the sigmoid function. The exploding gradient problem happens when the gradients become larger and larger as they propagate backwards through the network, especially when the network has large weights and biases and uses activation functions that do not saturate, such as the linear function.

There are several ways to deal with the gradient-related issues, such as:

Using activation functions that do not saturate, such as the rectified linear unit (ReLU) function, which has a constant gradient of 1 for positive inputs and 0 for negative inputs.
Using weight initialization techniques that prevent the weights and biases from being too large or too small, such as the Xavier initialization or the He initialization, which scale the weights and biases according to the size of the layers.
Using gradient clipping techniques that limit the magnitude of the gradients to a certain threshold, such as the norm clipping or the value clipping, which prevent the gradients from exceeding a certain norm or value.
Using batch normalization techniques that normalize the inputs of each layer to have zero mean and unit variance, such as the batch normalization layer, which reduces the internal covariate shift and improves the stability of the gradients.

A third challenge of backpropagation is handling the local minima and the saddle points, which are the points where the loss function has a zero gradient, but not necessarily a minimum value. A local minimum is a point where the loss function has a lower value than its neighboring points, but not the lowest value in the entire domain. A saddle point is a point where the loss function has a zero gradient in some directions, but not in all directions. Both local minima and saddle points can trap the network and prevent it from reaching the global minimum, which is the point where the loss function has the lowest value in the entire domain.

There are several ways to handle the local minima and the saddle points, such as:

Using a stochastic gradient descent technique that updates the network’s parameters using a random subset of the input data, rather than the entire data, such as the mini-batch gradient descent or the online gradient descent, which introduce some noise and variability to the gradients and help the network escape from the local minima and the saddle points.
Using a momentum technique that adds a fraction of the previous update to the current update, rather than using only the current gradient, such as the momentum method or the Nesterov accelerated gradient method, which accelerate the network’s movement along the direction of the steepest descent and help the network overcome the local minima and the saddle points.
Using an adaptive learning rate technique that adjusts the learning rate according to the magnitude and direction of the gradient, rather than using a fixed learning rate, such as the AdaGrad method, the RMSProp method, or the Adam method, which optimize the learning rate for each parameter and help the network converge faster and more efficiently.

These are some of the common challenges and limitations of backpropagation and how to deal with them. However, backpropagation is still a very effective and widely used algorithm for training feedforward neural networks and solving various problems. Backpropagation is the backbone of many neural network techniques and applications, and it is constantly being improved and refined by researchers and practitioners.

8. Conclusion

In this tutorial, you learned how feedforward neural networks and backpropagation work. You learned the basic concepts and terminology of neural networks, such as neurons, weights, biases, activation functions, and loss functions. You learned how to implement a simple feedforward neural network in Python and use it to solve a binary classification problem. You learned how to use the backpropagation algorithm to compute the gradients of the loss function with respect to the network’s parameters and update them using the gradient descent technique. You also learned some of the common challenges and limitations of backpropagation and how to deal with them.

Feedforward neural networks and backpropagation are powerful and general techniques that can be applied to many problems and domains. They are the foundation of many advanced and modern neural network models, such as convolutional neural networks, recurrent neural networks, and deep learning. By understanding how feedforward neural networks and backpropagation work, you can gain a deeper insight into the inner workings and principles of neural network training and application.

Thank you for reading and happy learning!

The complete tutorial list is here:

Deep Learning Tutorial Series: 50 Step-by-Step Lessons [FREE][2024]

Updated weekly — 03.12.2023

medium.com

Subscribe for FREE to get your 42 pages e-book: Data Science | The Comprehensive Handbook

Get step-by-step e-books on Python, ML, DL, and LLMs.