Question 1: What is weight initialization for a neural network model?

Weight initialization is setting the weights of a neural network to a set of values as the starting point for the model training process. It affects the neural network model performance.
We can specify the initial weights as all zeros, all ones, a constant number, or a distribution.

In the python code below, we use the TensorFlow initializer to set the initial weight in a normal distribution with 0 mean and unit standard deviation.

# Import tensorflow
import tensorflow as tf

# Set random normal initializer
initializer = tf.keras.initializers.RandomNormal(mean=0., stddev=1.)

# Apply the initializer to the layer
layer = tf.keras.layers.Dense(3, kernel_initializer=initializer)

Question 2: What is backpropagation?

Backpropagation is a key step in training a neural network model.
The goal of backpropagation is to update the weights for the neurons in order to minimize the loss function.
Backpropagation takes the error from the previous forward propagation and feeds this error backward through the layers to update the weights. This process is iterated until the neural network model is converged.

Question 3: What is the loss function of a neural network model?

A loss function is a function that measures the quality of a machine learning model by comparing the actual and the predicted target values. It is for a supervised model only because its calculation requires the ground truth value of the target.

When training a neural network model, we need to specify the loss function name depending on the nature of the project.

Linear Regression: For a neural network model with a continuous target variable,

👉 mean_squared_error is the default.

👉 mean_squared_logarithmic_error is the mean squared error (MSE) based on log values. It is usually used for the dependent variable with a wide range of values.

👉 mean_absolute_error is robust for data with outliers because the errors are not squared.

Binary Classification: For a neural network model with two discrete labels as the target variable,

👉 binary_crossentropy is the default. It's the same as the log loss in logistic regression.

👉 hinge is for the target variable of -1 and 1. It rewards the prediction of the same sign and penalizes the prediction if the signs are different.

👉 squared_hinge is for the target variable of -1 and 1. As the name suggests, it is the squared value of the hinge loss function.

Multi-Class Classification: For a neural network model with more than two discrete labels as the target variable,

👉 categorical_crossentropy is the same as binary_crossentropy but it is for multiple categories.

👉 sparse_categorical_crossentropy is good for the dependent variable with a lot of categories.

👉 kullback_leibler_divergence measures how a predicted probability distribution is similar to the target distribution.

Question 4: What are batch size and epoch in a neural network model?

Batch Size is the number of training samples in each forward propagation and backpropagation before the model weights are updated.
Epoch is the number of complete passes through the whole training dataset.

For example, if a neural network model has a batch size of 10 and the training sample size of 200, the model weights will be updated 20 times for each epoch.

Question 5: What are the commonly used activation functions for a neural network model?

An activation function is a mathematical transformation that enables nonlinearity in a neural network model. Activation function enables a neural network model to capture complex nonlinear patterns of the training dataset.

The most commonly used activation functions are listed below:

Linear activation function is usually used in the output layer of a neural network model with the continuous target variable.
Sigmoid activation function is usually used in the output layer of a neural network model with a binary classification target variable.
Softmax activation function is usually used in the output layer of a neural network model with a multi-class classification target variable.
Tanh activation function is usually used in the hidden layer of a neural network model. It’s a transformation of the sigmoid function (𝑡𝑎𝑛ℎ(𝑥)=2∗𝑠𝑖𝑔𝑚𝑜𝑖𝑑(2𝑥)−1), ranging from -1 to 1.
RELU is the rectified linear unit activation function. It is usually used in the hidden layer of a neural network model. The formula for RELU is 𝑚𝑎𝑥(0,𝑥), which sets negative values to 0.

Join Medium with my referral link - Amy @GrabNGoInfo

Read every story from Amy (and thousands of other writers on Medium). Your membership fee directly supports Amy and…

medium.com

Question 6: What are the differences between batch gradient descent, stochastic gradient descent, and mini-batch gradient descent?

Gradient descent is an optimization algorithm used to find the minimum of a function. It works by iteratively moving in the direction that reduces the value of the function the most. Gradient descent is a common algorithm used in machine learning to find the optimal parameters for a model. It can be used for both linear and classification models.

There are three commonly used gradient descent types, batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. The main difference between the three variants is the amount of data used each time the weights are updated.

Batch gradient descent uses the entire dataset to compute the gradient for each parameter update. Batch gradient descent uses the entire dataset to compute the gradient for each parameter update, so the weight updates are stable, but it requires high memory for large datasets.
Stochastic gradient descent (SGD) updates the model weights using one record at a time, so it requires less memory. However, Stochastic gradient descent is not stable. The frequent updates of the weights can produce noisy gradients, causing the loss to fluctuate instead of slowly decreasing.
Mini-batch gradient descent lies between batch gradient descent and stochastic gradient descent, and it uses a subset of the training dataset to compute the gradient at each step. Mini-batch gradient descent combines the benefits of batch gradient descent and stochastic gradient descent. Batch size is an important hyperparameter to tune in mini-batch gradient descent.

Question 7: What is batch normalization in a neural network model?

Batch normalization performs the normalization of layer inputs for each training mini-batch.

The pros of batch normalization are:

Batch normalization stabilizes the weight change. It allows us to use much higher learning rates and be less careful about initialization.
Batch normalization regularizes the neural network model because a training example is seen in conjunction with other examples in the mini-batches, and the training network no longer produces deterministic values for a given training example. The authors of the batch normalization paper mentioned that in some cases eliminating the need for Dropout.
It runs faster in the sense that it achieves the same accuracy in less time and with fewer epochs.

The con of the batch normalization is:

Because the normalization is performed at the batch dimension, it does not work well for small batch sizes because the mean and variance for the batch are not representative of the dataset. A general rule of thumb is to have at least 16 samples in one batch.

Question 8: What are exploding and vanishing gradients?

Vanishing and exploding gradients can happen for a deep multi-layer artificial neural network model or a recurrent neural network (RNN) model. The weights of the model cannot be updated properly when vanishing or exploding gradients happen.

Vanishing gradients refer to the scenario that the gradients get smaller and smaller when the model back propagates and the weights of the model cannot be updated properly.
Exploding gradients refer to the scenario that the gradients get larger and larger when the model back propagates and the weights of the model cannot be updated properly.

We can identify vanishing and exploding gradients by monitoring the training process.

If vanishing gradients happen, we can observe that larger updates are applied to the weights of later layers and smaller or even no updates on the weights of earlier layers. The model learns slowly and training stops with a model with poor performance
If exploding gradients happen, we can observe unstable updates from iteration to iteration and larger updates applied on the weights of earlier layers. The model weights and loss can become NaN quickly.

There are several solutions for fixing vanishing gradients.

We can use ReLU as the activation function and avoid using sigmoid or tanh as the activation function. This is because the derivative of a ReLU activation function is either 0 or 1, which will not vanish the gradients.
We can also make the model structure simpler by including fewer hidden layers.
Another way of reducing vanishing gradients is to initialize weights from a uniform or normal distribution of certain variances, and maintain the variance of activations the same across all layers. In TensorFlow, this is implemented as the glorot_normal and glorot_uniform for kernel_initializer.
Lastly, we can use an optimizer with momentum (e.g., Adam) that factors in the accumulated previous gradients.

To fix exploding gradients, we can use the following methods:

Use gradient clipping to cap the derivatives to a threshold and uses the capped gradients to update the weights.
Setting weight initializer as the glorot_normal and glorot_uniform in a TensorFlow model can also help reduce the exploding gradients.
We can also use L2 regularization to shrink the weights and prevent exploding gradients.

Question 9: What is zero-shot learning?

Zero-shot learning (ZSL) refers to building a model and using it to make predictions on the tasks that the model was not trained to do.

For example, if we would like to classify millions of news articles into different topics, building a traditional multi-class classification model would be very costly because manually labeling the news topics takes a lot of time. Zero-shot text classification is able to make class predictions without explicitly building a supervised classification model using a labeled dataset. Zero-shot text classification is a Natural Language Inference (NLI) model where two sequences are compared to see if they contradict each other, entail each other, or are neutral (neither contradict nor entail).

Please check out my previous tutorial Zero-shot Topic Modeling with Deep Learning Using Python Hugging Face for the python code for a zero-shot model.

Question 10: What is transfer learning?

Transfer learning is a machine learning technique that reuses a pretrained large deep learning model on a new task. It usually includes the following steps:

Select a pretrained model that is suitable for the new task. For example, if the new task includes text from different languages, a multi-language pretrained model needs to be selected.
Keep all the weights and biases from the pretrained model except for the output layer. This is because the output layer for the pretrained model is for the pretrained tasks and it needs to be replaced with the new task.
Feed randomly initialize weights and biases into the new head of the new task. For a sentiment analysis transfer learning (aka fine-tuning) model on a pretrained BERT model, we will remove the head that classifies mask words, and replace it with the two sentiment analysis labels, positive and negative.
Retrain the model for the new task with the new data, utilizing the pretrained weights and biases. Because the weights and biases store the knowledge learned from the pretrained model, the fine-tuned transfer learning model can build on that knowledge and does not need to learn from scratch.

To learn how to implement transfer learning in python, please check out my tutorials on transfer learning using TensorFlow, transformers trainer, and PyTorch.

More tutorials are available on GrabNGoInfo YouTube Channel, GrabNGoInfo.com, and LinkedIn.

Top 10 Deep Learning Concept Interview Questions and Answers