avatarbtd

Summary

The website presents a data science project that employs a ResNet-18 deep learning model with Grad-CAM visualization for explainable scene classification from images, aiming to address challenges in climate, agriculture, water, and biodiversity.

Abstract

The project detailed on the website focuses on developing an explainable AI system for scene classification using a Residual Neural Network (ResNet-18) architecture. The model is trained to classify images into six categories: Building, Forest, Glacier, Mountain, Sea, and Street. To enhance the interpretability of the model's decisions, Gradient-Weighted Class Activation Mapping (Grad-CAM) is utilized, providing visual explanations of the model's focus areas within the images. The dataset comprises 14,034 training images and 3,000 test images, ensuring a balanced distribution across the six classes. The project also addresses model generalization through image augmentation and employs techniques such as early stopping to prevent overfitting. The model's performance is evaluated using metrics such as accuracy, precision, recall, and F1-score, with a reported test accuracy of 70%. The Grad-CAM visualizations demonstrate the model's ability to concentrate on relevant image regions, offering insights into its decision-making process and highlighting areas for potential improvement.

Opinions

  • The project emphasizes the importance of explainability in AI, particularly for complex models like deep CNNs.
  • The authors suggest that the use of Grad-CAM can help build trust in AI systems by making their decision-making processes more transparent.
  • The project acknowledges the challenge of class imbalance and takes steps to ensure the dataset is balanced, which is crucial for fair model performance.
  • The authors express confidence in the model's ability to generalize well, thanks to the use of image augmentation and regularization techniques.
  • The reported accuracy of 70% is considered a solid baseline, with room for improvement through further hyperparameter tuning and model architecture refinements.
  • The project identifies specific areas where the model may struggle, such as distinguishing between visually similar classes like Glacier and Mountain.
  • The authors propose future work directions, including exploring other methods for explaining CNN outputs and applying transfer learning to enhance model performance.

【Data Science Project】 Explainable AI: Scene Classification with ResNet-18 and Grad-CAM Visualization

Introduction

Scene Classification is a special task in Computer Vision. Unlike Object Classification, which focuses on classifying prominent objects in the foreground, Scene Classification uses the layout of objects within the scene, in addition to the ambient context, for classification (King et al, 2017). This project could be practically used for detecting the type of scenery from the satellite images and from that we will be able to find solutions to challenges relating to climate, agriculture, water and biodiversity.

Explainable AI or XAI is an emerging field in machine learning that aims to address how the black box decisions of AI systems are made i.e. explain to humans how an AI system made a decision. XAI is used to describe an AI model for both of its expected impact and potential biases (IBM, n.d.). Explainability can help developers ensure that the system is working as expected, it might be necessary to meet regulatory standards, or it might be important in allowing those affected by a decision to challenge or change that outcome (IBM, n.d.).

There are many approaches to explain CNN outputs such as: Activations Visualization, Vanilla Gradients, Occlusion Sensitivity, CNN Fixations, Class Activation Mapping (CAM), and finally Gradient-Weighted Class Activation Mapping (Grad-CAM). In this project, we choose to use Grad-CAM.

Problem Statement

In this project, we will build and train a Deep Convolutional Neural Network (CNN) with Residual Blocks to detect the type of scenery in an image. In addition, we will also use a technique known as Gradient-Weighted Class Activation Mapping (Grad-CAM) to visualize the regions of the inputs and help us explain how our CNN models think and make decision.

Methodology

(Ahmed, R, n.d.)

We are going to feed in the image (96x96) into a ResNet-18 model to classify if the scene belong to any one of the 6 classes: Building, Forest, Glacier, Mountain, Sea, and Street.

Our feature field (X):

  • image: (96, 96) in grayscale

Our predicted field (y):

  • Building
  • Forest
  • Glacier
  • Mountain
  • Sea
  • Street

Metrics

The performance of the models are evaluated based on:

  1. Accuracy score
  2. Precision: True Negative (TN) or specificity to determine how good the model is at detecting negatives
  3. Recall: True Positive (TP) or sensitivity to determine how good the model is at detecting the positives
  4. F1: harmonic mean of precision and recall
  5. Confusion matrix: to observe True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN)

Data Description

The dataset contains 14034 images in the Train folder and 3000 images in the Test folder. Each folder has 6 sub-folders belong to 6 different categories: Building, Forest, Glacier, Mountain, Sea, and Street.

Data Exploration

The packages and libraries we will be using are:

Number of train images : 14034 
Number of test images : 3000

Data Visualization

Let’s visualized 5 sample images from every single class. That will be 6 rows, one for each class, and 5 columns, one for each of sample.

We observe that Building, Forest, and Street are pretty distinguishable from the rest. However, Glacier and Mountain look very much similar. One of the Sea image and Glacier image also include Mountain in them. This could be source of confusion for the model.

Let’s see if our dataset is balanced between the classes.

Class_name 
['buildings', 'forest', 'glacier', 'mountain', 'sea', 'street']
Number of images in buildings = 2191 
Number of images in forest = 2271 
Number of images in glacier = 2404 
Number of images in mountain = 2512 
Number of images in sea = 2274 
Number of images in street = 2382

Our dataset is balanced so our result will not be bias or skewed toward any class.

Image Augmentation

Now we are going to perform data argumentation and create data generators.

To build a powerful image classifier, image augmentation is usually required to boost the performance of deep networks. Image augmentation artificially creates training images through different ways of processing or combination of multiple processing, such as random rotation, shifts, shear and flips, etc of each training instances. The purpose is to improve the model generalization capability. The model will be now able to see all the different variations of the images so it would be able to generalize better and avoid over fitting. To achieve this, ImageDataGenerator is used. It will automatically label all the data inside Building, Forest, Glacier, Mountain, Sea, and Street folder. In this way data is easily ready to be passed to the neural network. Only train set is augmented.

Next, from_from_directory is used to get all of the augmented data into the directory. The function passes the folder which has train data train_datagen to the object train_generator and similarly passes the folder which has test data test_datagen to the object test_generator, and validation data to the object validation_generator. Validation data comes from the same train data set but we set subset = 'validation' instead of subset = 'training'.

Since this is a multi-class classification problem, we set class_mode = 'categorical'.

Found 11932 images belonging to 6 classes.
Found 2102 images belonging to 6 classes.
Found 3000 images belonging to 6 classes.

Convolutional Neural Network Model (CNN) and Residual Blocks

Convolutional Neural Network Model (CNN)

In terms of DL, convolutional neural network (CNN) is the leading DL tool that is used to analyze visual images. A CNN architecture is composed of convolutional layer with ReLU, pooling layer, and lastly fully connected Dense layers. As observed in the figure, the input image gets smaller and smaller as it progresses through the network but it also gets deeper and deeper with feature map.

Basic CNN Model Architecture (source)

Since we trained these artificial neural networks using a technique known as the gradient descent, as we keep adding one convolutional layer + pooling layer after another, an issue arise known as the vanishing gradient problem. And when we have a very deep neural network, the gradient eventually becomes very, very small and vanishes. So we will not be able to actually perform the training and the performance of the network is reduced dramatically.

As researchers like to make deeper and more complex CNNs by adding more layers, the bulk of layers make it become increasingly difficult to train them and the accuracy starts saturating and then degrades. It has been found that there is a maximum threshold for depth with the traditional Convolutional neural network model.

Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error (He et al., 2016).

And so ResNet was developed to help solve this degradation problem.

Residual Network (ResNet)

ResNet, short for Residual Network, is a special type of neural network that was introduced in 2015 by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun in their paper “Deep Residual Learning for Image Recognition”.

Residual Block vs. Plain

In traditional “plain” neural networks, each layer feeds into the next layer. In a network with residual blocks, each layer feeds into the next layer and skip connection 2–3 layers away. In a very simple form, we have the input X. We apply various convolutions to X to get F(X)and at the same time we pass along X as is (without being processed through conv), this is known as X identity. In the end, we will add up F(X)) and X. By doing so, we will be able to actually overcome the vanishing gradient issue and we would be able to even stack hundreds or even thousands of layers and still achieve compelling performance.

Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts (He et al., 2016).

As we can see, the deeper 34-layer plain net (left) has higher validation error than the shallower 18-layer plain net while the situation is reversed with residual learning: the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). The 34-layer ResNet exhibits considerably lower training error and is generalizable to the validation data therefore addresses the degradation problem.

Build Residual Convolutional Neural Network (ResNet) Model

We will build a ResNet-18 with the architecture as follow. This is adapted from this Guided Project by Ryan Ahmed.

(Ahmed, n.d.)

What we see from the left image is the basic CNN architecture which is composed of convolutional layer Conv2D with ReLU activation, down-sampling pooling layer MaxPool2D, and we flatten them up with Flatten before we feed them into the fully connected Dense layer with softmax activation function. To create a ResNet-18 model, we will also add 5 blocks of RES-BLOCK in between 2 pooling layers MaxPool2D and AveragePooling2D.

A RES-BLOCK consists of CONVOLUTION BLOCK and 2 IDENTITY BLOCK. And each CONVOLUTION BLOCK and IDENTITY BLOCK has the architecture as below:

(Ahmed, n.d.)

Notice there is nothing in the Short path of the IDENTITY BLOCK, the INPUTis processed as is.

In the end, we will sum up CONVOLUTION BLOCK and IDENTITY BLOCK.

Let’s define the code for res_block which has 3 things:

  • Convolutional_block
  • Identity Block 1
  • Identity Block 2

Convolutional_block contains Main Path and Short Path while Identity Block 1 and Identity Block 2 only have Main Path defined. We add X and X_copy at every block.

Model Architect

input_shape = (256, 256, 3) with 3 for colored images.

ZeroPadding2D: in order to allow a residual layer to span convolutional layers with multiple dimensions, zero-pad the channel dimension of the output of the layers is required to maintain a consistent size with the residual.

Conv2D: Convolutional layers apply a filter to input and create a feature map that summarizes the presence of detected features in the input. I add one Conv2D layer with a filter/kernel size of (7, 7)and stride (2, 2).

kernel_initializer = glorot_uniform is the default kernel initializer. It draws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units) (doc).

BatchNormalization: to accelerate the training with higher learning rates. This is done right before any ReLU activation.

activation = 'relu' to set negative values to zero.

MaxPooling2D: Pooling layer is sandwiched between two successive convolutional layers to reduce the spatial size of the convoluted feature/ parameters in the network. MaxPooling is the most common pooling methods to reduce image size by pooling the most important feature. Here I use MaxPooling2D with a pool size of (7, 7) and stride (2, 2)meaning it divides each spatial dimension by a factor of 7.

res_block is added 4 times following the Conv2D, BatchNormalization, and MaxPooling2D.

AveragePooling2D: while MaxPooling calculate the maximum value for each patch of the feature map, AveragePooling calculate the average value for each patch on the feature map.

Flatten to convert each input image into a 1D array: if it receives input data X, it computes X.reshape(-1, 1).

Dense The flatten data that comes out of the convolutions and is then fed to the fully connected layers which consist of one hidden Dense with 6 neuron for 6 classes Building, Forest, Glacier, Mountain, Sea, and Street using activation = 'softmax' since this is a multi-class classification problem.

Total params: 19,952,262
Trainable params: 19,909,894
Non-trainable params: 42,368

Since this model is large, I won’t include the model summary here. For the full model summary, check out my Github notebook.

Compile Model

loss = 'categorical_entropy' because this is a multi-class classification problem.

optimizer = 'adam' implementation is used for mini-batch gradient descent.

accuracy metrics is evaluated of the model.

Train Model

We use fit_generator to fit train_generator into our training.

validation_data = validation_generator uses validation set to validate training.

epoch = 5trains the model for 1 iteration, due to the time and resources it takes to train this super complex model

callbackswith Early Stopping:stop training if there is no further improvement in both loss and accuracy.

patience = 15:a regularization technique, meaning that the model will stop training if it doesn’t see any improvement in val_loss in 15 epochs. We also set mode = 'min' since we want to minimize loss.

Epoch 1/5
372/372 [==============================] - 3543s 10s/step - loss: 0.7946 - accuracy: 0.7061 - val_loss: 4.0651 - val_accuracy: 0.3514
Epoch 00001: val_loss improved from inf to 4.06513, saving model to weights.hdf5
Epoch 2/5
372/372 [==============================] - 272s 731ms/step - loss: 0.6516 - accuracy: 0.7640 - val_loss: 0.9017 - val_accuracy: 0.6899
Epoch 00002: val_loss improved from 4.06513 to 0.90166, saving model to weights.hdf5
Epoch 3/5
372/372 [==============================] - 270s 726ms/step - loss: 0.5687 - accuracy: 0.7947 - val_loss: 0.8589 - val_accuracy: 0.6942
Epoch 00003: val_loss improved from 0.90166 to 0.85886, saving model to weights.hdf5
Epoch 4/5
372/372 [==============================] - 269s 722ms/step - loss: 0.5306 - accuracy: 0.8139 - val_loss: 1.3859 - val_accuracy: 0.6332
Epoch 00004: val_loss did not improve from 0.85886
Epoch 5/5
372/372 [==============================] - 272s 731ms/step - loss: 0.4942 - accuracy: 0.8250 - val_loss: 1.5417 - val_accuracy: 0.5308
Epoch 00005: val_loss did not improve from 0.85886

Model Evaluation

Now we are going to assess the performance of the train model. Keras measures the loss and accuracy at the end of every epoch. All training accuracy, val_accuracy, loss, and val_loss can be accessed with history.

373/373 [==============================] - 199s 534ms/step - loss: 0.8251 - accuracy: 0.6991
Train loss & accuracy: [0.8250985741615295, 0.6991283893585205]
94/94 [==============================] - 1877s 20s/step - loss: 0.8385 - accuracy: 0.7003
Test loss & accuracy: [0.8385239839553833, 0.7003333568572998]

Train accuracy and test accuracy are similar (70% & 70% respectively) so our model is not overfit. Train curve and validation curve follows each other closely. Accuracy improves over time and loss decreases over time.

Make Prediction

  • We need to assign label names to the corresponding indexes: Building is 0, Forest is 1, Glacier is 2, Mountain is 3, Sea is 4, and Street is 5.
  • Open the image using PIL
  • Resize the image to (256, 256)
  • Append image to the image list
  • Convert images to array
  • Normalize images
  • Reshape images to 4D array
  • Use predict to make prediction on the image list.
  • Use argmax to get the image labels

Now we can access our test accuracy with accuracy_score between original which is our ground truth and prediction which is our model prediction.

Test Accuracy : 0.7053333333333334

With 5 epochs of training, our model achieved a 70% accuracy score. This can be improved if we let our model trains for longer and with a bit more of hyper-parameter tuning and tweaking the model architect.

Visualizing Model Prediction

Now we want to print out the image with ground truth and model predictions.

Classification Report

Forest did the best with 0.95 precision (TN) and 0.93 recall (TP).

Glacier is the worst one when it comes to precision (TN) and f1-score as it is often mistaken for Mountain.

Confusion Matrix

Building is 0, Forest is 1, Glacier is 2, Mountain is 3, Sea is 4, and Street is 5

The model messed up quite often with Glacier with Mountain which is understandable.

Sea is also often confused with Glacier and Mountain as they are often photographed together.

Visualize Activation Maps through Grad-CAM

One of the major challenges with AI is that while we can build and train a massive model, after the model is trained, the actual weights and all the activations are kind of a black box. It is really hard, especially with these networks that have millions of parameters, to correlate what weight contributed to the actual outputs. This problem with model interpretability makes it hard to trust the AI’s decision.

To explain how our ResNet-18 model made its decision, we will use Grad-CAM to help visualize the region’s of the input that has contributed towards making predictions by the model. Grad-CAM works by (1) finding the final convolutional layer in the network and then (2) examining the gradient information flowing into that layer (Rosebrock, 2020). Afterwards, it computes an importance score based on the gradients to produce a heatmap, highlighting the important regions within the image that resulted in a given class label. In short, it uses gradients as weights (grad-weights) to highlight important regions in images.

Grad-CAM is an important tool for us to learn to ensure that our model is performing correctly.

(Ahmed, R., n.d)

In the above diagram, here are the steps for visualizing Grad-CAM (Ahmed, n.d.):

Step 1: Pass the image through the model to make prediction:

  • Convert the image into an array
  • Reshape the image from (256, 256, 3) to (1, 256, 256, 3)
  • Normalize the image by dividing by 255, we’ll get img_scaled in the end of this step.

Step 2: We create 2 new models: final_conv_model (from input to A) and classification_model (from A to C).

Classification Layers

a. final_conv_model (from input to A /activation):

  • Specify the layer that we’re interested in which is the last convolution layer res_5_identity_2_c of our original model by using get_layer and store it in final_conv
  • The new final_conv_model consist of our original ResNet-18 model inputs model.inputs along with the final layer output final_conv.output.

b. classification_model (from A to C /classification): is the model that takes all the activations and has all the Dense layers up until class generation step.

  • The input of the new model classification_input is the output of the final_conv from the original ResNet-18 model.
  • classification_layers is the layer that makes predictions.
  • We iterate through classification_layers which are Averagea_Pooling and Dense_final to get to final_conv then append them into our new model classification_model.
  • classification_model receives classification_input which is the final_conv output.
  • We do backpropagation until we reach the final_conv

Step 3: Use GradientTape to monitor final_conv_output:

  • GradientTape retrieves the gradients from the first model final_conv_model.
  • GradientTape records the gradients and stores them in tape.
  • watch monitors the final_conv_output.

Step 4: Use argmax to find the index corresponding to the maximum value in the prediction, which gives the predicted class its predicted value:

  • Pass feature map final_conv_output generated from the first model final_conv_model and feed it through the second model classification_model to generate prediction.
  • Use argmax to make prediction on prediction and get predicted_class_value.

Step 5: Calculate the gradient using that is used to arrive at that predicted value with respect to feature map activation of the convolution layer:

  • gradient extracts the desired gradients (the gradient of the loss from the output of the convolutional layer) from tape to get gradient_channels.

Step 6: To enhance the filter predicted value, we multiply that filter predicted value with the filter value in the last convolutional layer (Linear Combination).

Step 7: Perform weighted combination of activation maps and follow it by a ReLU to obtain the heatmap.

Step 8: Super-impose the feature heatmap onto the original image to see the activation locations in the image.

Here is the full code adopted from Keras documentation:

Now we are going to visualize the results. Let’s look at 6 samples:

We will see that the first column is the original images, the second column is the heatmaps, and third column is the heatmaps superimposed onto the images describing which portions of the image strongly influence the output..

The light yellow-green-ish areas are what the model is looking at. For example, in the first image, it looked at the sea and blocked out the mountain, hence misclassified it as Sea instead of Glacier.

The second image is kind of confusing. The model looked at the river yet it classified the Building correctly.

It looked at both the sea and the sky in the third image yet it knows to classify it as Sea.

In forth and fifth image, it correctly located the trees and buildings.

In the last one, it look at both street and building and made the decision to classify Street.

As far as we can tell, the model does know where to look at in an image, but if the images contain multiple components i.e. street with tall building or a path in a forest, it has trouble deciding which component of the image are more important than the other.

Future Work

  • Try other approaches to explain CNN outputs such as: Activations Visualization, Vanilla Gradients, Occlusion Sensitivity, CNN Fixations, and Class Activation Mapping (CAM)
  • Apply Transfer Learning

Github

Reference

Ahmed, R. (n.d.). Explainable AI: Scene classification and Grad-CAM visualization [MOOC]. Coursera. https://www.coursera.org/projects/scene-classification-gradcam

Explainable AI. IBM. (n.d.). https://www.ibm.com/watson/explainable-ai.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778.

King, J., Kishore, V., Ranalli, F. (2017). Scene classification with Convolutional Neural Networks.

Rosebrock, A. (2020, March 9). Grad-CAM: Visualize Class Activation Maps with Keras, TensorFlow, and Deep Learning. PyImageSearch. Retrieved September 10, 2021, from https://www.pyimagesearch.com/2020/03/09/grad-cam-visualize-class-activation-maps-with-keras-tensorflow-and-deep-learning/.

Computer Vision
Cnn
Residual Block
Grad Cam
Deep Learning
Recommended from ReadMedium