Master-Level Questions in Deep Learning

Following the success of my master-level questions in data science, I have decided to publish another series of master-level questions, this time focused on deep learning.
Note that there may be more than one correct answer for each question (but there is always at least one correct answer).
- Which of the following can solve the dying ReLU problem? (a) Leaky ReLU (b) Low learning rate (c) Dropout (d) Batch normalization
- What is the benefit of using momentum optimization? (a) Allows gradient descent to escape from local minima. (b) Effectively scales the learning rate to act the same amount across all dimension. (c) Makes the path to the minimum error smoother. (d) Momentum-based SGD is faster than vanilla SGD.
- Which of the following is true about dropout? (a) Dropout can only be applied to the hidden layers. (b) Dropout can be compared to bagging technique in machine learning. (c) At test time, dropout is applied with inverted keep probability. (d) A higher dropout rate increases the variance of the network.
- Which of the following SGD optimizers is based on both adaptive learning rates and momentum? (a) AdaGrad (b) RMSProp (c) Adam (d) Nadam
- Which of the following techniques prevents a model from overfitting? (a) Batch normalization (b) Data augmentation (c) Early stopping (d) Adding momentum
- Which of the following statements is false regarding padding in CNN? (a) Padding is used both in convolutional and pooling layers. (b) In valid padding, we drop the part of the image where the filter does not fit. (c) Zero padding is used to preserve the spatial size of the image. (d) Zero padding is used to preserve the resolution of the image.
- A convolutional layer with 7 kernels of size 5 × 5, with zero padding and stride of 3 is applied to an RGB image of size 224 × 224. What will be the dimensions of the data that the next layer will receive? (a) 74 × 74 × 3 (b) 75 × 75 × 5 (c) 74 × 74 × 7 (d) 75 × 75 × 7
- Which of the following statements is true about Xavier initialization? (a) It helps reduce the vanishing gradient problem. (b) It can help the input signals reach deep into the network. (c) It is only used in fully-connected networks. (d) The initial weights are drawn from a Gaussian distribution.
- Which kind of activation function is typical for a recurrent layer in RNN? (a) Sigmoid (b) Hyperbolic tangent (c) ReLU (d) Leaky ReLU
- Which of the following statements about variational autoencoders is true? (a) Variational autoencoders learn a continuous latent space that is easy to sample from. (b) A variational autoencoder is able to calculate the sample probability p(xᵢ) for a given data sample xᵢ. (c) Variational autoencoders optimize a lower bound on the log likelihood of the data. (d) Variational autoencoders can generate new data by sampling from the learned latent space.
The solutions to these questions can be found here.






