avatarCleon W

Summary

This article discusses a deep learning project aimed at identifying mass abnormalities in mammogram scans using an image segmentation model.

Abstract

This article is Part 1 of a 3-part series that walks through a deep learning project of identifying mass abnormalities in mammogram scans using an image segmentation model. The article covers the problem statement, explains what semantic segmentation is, guides the reader on downloading the dataset, explains what the dataset contains, and discusses the folder structure of the dataset. The chosen dataset for this project is CBIS-DDSM, which contains real-world mammogram scans that are "messy" enough to require robust and intentional image preprocessing. The article also provides a brief overview of the DICOM format and how to work with DICOM in Python.

Opinions

  • The CBIS-DDSM dataset is suitable for computer vision projects of intermediate complexities.
  • The CBIS-DDSM dataset is large enough to conduct decent model training.
  • The CBIS-DDSM dataset contains real-world mammogram scans that require robust and intentional image preprocessing.
  • The article provides a brief overview of the DICOM format and how to work with DICOM in Python.
  • The article provides a guide on how to download the dataset and explains the folder structure of the dataset.
  • The article explains what semantic segmentation is and how it differs from other types of image segmentation techniques.
  • The article provides a problem statement for the deep learning project.

Click to navigate: Part 1 -> Part 2 -> Part 3

Segmenting Abnormalities in Mammograms (Part 1 of 3)

A step-by-step guide to implementing a deep learning semantic segmentation pipeline on mammograms in TensorFlow 2

Image by author. Mammograms and masks retrieved from CBIS-DDSM.

If you are reading this article, chances are that we share similar interests and are in similar industries. So let’s connect via LinkedIn, where I share tidbits of my thoughts and resources about AI and ML!

Article Structure

This article is Part 1 of a 3-part series that walks through how I tackled a deep learning project of identifying mass abnormalities in mammogram scans using an image segmentation model. As a result of breaking down the project in detail, this serves as a comprehensive overview of one of the core problems in computer vision — semantic segmentation, as well as a deep dive into the technicalities of executing this project in TensorFlow 2.

Part 1:

  • Problem statement.
  • What is semantic segmentation.
  • Guide to downloading the dataset.
  • What you’ll find in the dataset.
  • Unravelling the nested folder structure of the dataset.
  • Data exploration.

Part 2:

  • Image preprocessing pipeline overview.
  • General issues with the raw mammograms.
  • Deep dive into raw mammogram’s preprocessing pipeline.
  • Deep dive into corresponding mask’s preprocessing pipeline.

Part 3:

  • Introducing the VGG-16 U-Net model.
  • Implementing the model in TensorFlow 2.
  • Notes on training the model.
  • Results and post analysis.
  • Wrapping up.

GitHub Repository

The code for this project can be found on my Github in this repository.

1. Problem Statement

The goal of the project is to segment mass abnormalities in scanned film mammogram images.

Full mammogram scans will serve as 2D inputs into an image segmentation model with their respective binary masks as the ground truth labels. The model will output a predicted mask for each mammogram.

Fig 1. Brief overview of model pipeline, input and output images and input label. Mammograms and masks retrieved from CBIS-DDSM. Image drawn by author.

1.1. The Chosen Dataset — CBIS-DDSM

I chose to use the CBIS-DDSM dataset because it is suitable for computer vision projects of intermediate complexities. With 2,620 scanned film mammography images, it is large enough to conduct decent model training. Furthermore, since the CBIS-DDSM dataset contains real world mammogram scans, they are “messy” enough such that robust and intentional image preprocessing needs to be done in order to achieve decent results at the task at hand. Image preprocessing is covered in Part 2 here.

Most interestingly, each mammogram in the dataset comes with a binary mask that indicates only a general position of the abnormalities. Since these masks do not provide precise segmentations, there is a need to implement segmentation algorithms for accurate feature extraction and diagnosis.

These are the main reasons why I have chosen the CBIS-DDSM dataset to tackle the semantic segmentation task.

Fig 2. Example of raw mammogram scans and their respective binary masks. Each column represents a unique patient. Overlays (bottom row) are generated by me and not provided in the dataset. Mammograms and masks retrieved from CBIS-DDSM. Image drawn by author.

2. What Is Semantic Segmentation?

Simply put, any kinds of segmentation (yes, there is more than one kind of segmentation and yes, there are other methods apart from segmentation, namely localisation and detection) answers the following question:

“Where is the object of interest located in the image?”

Finding where the objects of interest are in an image is a natural step from image classification when in the business of scene understanding. Image classification tells us “what is in the image” (i.e. it makes a prediction about the entire input image). Localisation, detection and segmentation then tells us “where is [the object of interest] located in the image”.

Fig 3. Image classification vs segmentation. Image drawn by author.

There are two main forms of pose information that tell us where objects are located in an image — bounding boxes and masks. Object localisation models and object detection models outputs predicted bounding boxes, while image segmentation models outputs predicted masks. Image segmentation can then be further broken down into semantic segmentation models and instance segmentation models. As such, the four main types of techniques for locating objects in images are:

  • Object localisation: involves locating one instance of an object class (a.k.a label), usually by predicting a tightly cropped bounding box centred on the instance. It usually comes with classification, the common terminology that you will encounter is ‘classification + localisation’.
  • Object detection: involves detecting multiple instances of one or more object class in an image. Similar to object localisation, it predicts bounding boxes around each detected instance of every object class.
  • Semantic segmentation: predicts for each pixel of an image the object class (amongst two or more object classes) it belongs to. All object classes must be known to the model. The output is a predicted mask.
  • Instance segmentation: is a more elaborate form of semantic segmentation. The difference is that it is able to differentiate two instances of the same object class. For example, it is able to separate one pedestrian from another pedestrian in an image of a sidewalk. The output is also a predicted mask.
Fig 4. Illustration of the outputs from each of the 4 above-mentioned techniques. Note the difference between semantic segmentation and instance segmentation. Mammograms and masks retrieved from CBIS-DDSM. Image drawn by author.

These are my understandings of the techniques. However, note that these terms are not as clearly defined in the scientific community as we would like, so you may encounter slightly different meanings for any of them. You may refer to this, this and this article for a more thorough understanding of the above four concepts.

3. Downloading the Data

The dataset can be found here, from The Cancer Imaging Archive.

Downloading the data is pretty straightforward if you follow the general steps below. Note that I am working on a Mac, there might be slight difference when working on other systems.

Step 1— Install the NBIA Data Retriever from the Mac App Store. Follow this link for detailed instructions.

Step 2 — Download the .tcia manifest file to your local computer.

Step 3 — Open the just-downloaded .tcia manifest file using the NBIA Data Retriever.

Step 4 — In the NBIA Data Retriever, click ‘Browse’ to select the directory that you would like to save the dataset in.

Step 5 — Then click ‘Start’ to start downloaded the dataset.

Heads up, because this dataset is around 160GB, it might take a while to download.

4. What you’ll find in the dataset

The dataset contains breasts that contains calcification, mass abnormalities, or both. This article does a good summary of what masses and calcifications are.

The messy and nested folder structure that you’ll see in the downloaded dataset will be explained and resolved in Part 2. For now, we will focus on understanding the types of data that is in the dataset.

The dataset contains two main types of files — DICOM (.dcm ) files and .csv files.

Fig 5. Summary of the file types that can be found in the CBIS-DDSM dataset. Mammograms and masks retrieved from CBIS-DDSM. Image drawn by author.

4.1. The .dcm files

The images in the CBIS-DDSM dataset are by default decompressed into the DICOM format. Refer to this, this and this for a brief overview of the DICOM format and how to work with DICOM in Python.

A patient can have two mammogram scans done for the same breast, each in a different view (namely MLO and CC). Each of these mammogram comes with the following 3 kinds of images:

  1. Full mammogram scan: The actual mammogram.
  2. Cropped image: A zoomed-in crop of the mass abnormality.
  3. Region of interest (ROI) mask: The ground truth binary mask that locates the mass abnormality.

These images are each decompressed into its own .dcm file. There is a separate .dcm file for each of these 3 kinds of images (i.e. a breast with only 1 mass abnormality will have 3 .dcm files, one for each of the 3 kinds of images). Each.dcm file contains the image (in the form of arrays) and other information about the scan such as Patient's ID , Patient Orientation, Series Description, Modality and Laterality.

You will find breasts with more than 1 mass abnormalities. These cases will have a cropped image and a ROI mask for each mass abnormality.

4.2. The .csv files

The .csv files serve as a directory for the mammogram scans. There are 4 .csv files:

  1. Calc-Test-Description.csv
  2. Calc-Train-Description.csv
  3. Mass-Test-Description.csv
  4. Mass-Train-Description.csv

Each of these files contains information about each mammogram such as breast density, image view and pathology. Notice that there are repeats of Patient ID down the rows. Again, this shows that a single patient can have multiple mammograms (either multiple views of the the same breast, mammograms of the left and right breast, or both).

5. Unravelling the Nested Folder Structure of the Dataset

One often overlooked explanation in other projects that use the CBIS-DDSM dataset is reorganising the folder structure into a structure that is easy to work with.

From the snippet below, we see that the original folder structure of the downloaded dataset is nested, has non-descriptive subfolder names and have non-unique .dcm file names. These make it extremely challenging to feed images into the image preprocessing pipeline (and eventually the model training pipeline). Hence, we will write some code to create a new folder structure to resolve these issues.

5.1. Explaining the original folder structure

Folder structure BEFORE restructuring
=====================================
CBIS-DDSM
│
├── Calc-Test_P_00038_LEFT_CC
│   └── 1.3.6.1.4.1.9590...
│       └── 1.3.6.1.4.1.9590..
│           └── 1-1.dcm <--- full mammogram scan
│
├── Calc-Test_P_00038_LEFT_CC_1
│   └── 1.3.6.1.4.1.9590...
│       └── 1.3.6.1.4.1.9590...
│           ├── 1-1.dcm <--- binary mask? cropped image?
│           └── 1-2.dcm <--- cropped image? binary mask?
...
┌─────────────────┐
│ bold : folder   │
│ non-bold : file │
└─────────────────┘

Patient P_00038’s left CC mammogram scan is saved as a 1-1.dcm file under the parent folder Calc-Test_P_00038_LEFT_CC. Breaking down the parent folder name, we have:

  • Calc (or Mass): The type of abnormality present in the mammogram.
  • Test (or Train): The mammogram belongs to the test set (CBIS-DDSM has already split the mammograms into a train and test sets).
  • P_00038: The patient’s ID.
  • LEFT (or RIGHT): The left breast.
  • CC (or MLO): The orientation that the mammogram scan was done in.

Patient P_00038’s left CC mammogram scan has its corresponding binary mask and cropped image. These are saved in a separate folder Calc-Test_P_00038_LEFT_CC_1 (notice the _1 at the end of the folder name). Here comes the messy bit — it is impossible to tell just from the filenames alone whether 1-1.dcm and 1-2.dcm are the binary mask and the cropped image respectively or, vice versa. If a mammogram has more than one abnormalities, then each abnormality’s corresponding mask and cropped image will be saved in a similar fashion but in separate folders ending with _2 , _3, and so on.

5.2. Creating the new folder structure

The snippet below shows the new folder structure after restructuring. It is now no longer nested and individual files have unique and descriptive filenames.

Folder structure AFTER restructuring
=====================================
CBIS-DDSM
│
├── Calc
│   ├── Test
│   │  ├── Calc-Test_P_00038_LEFT_CC_FULL.dcm
│   │  ├── Calc-Test_P_00038_LEFT_CC_CROP_1.dcm
│   │  ├── Calc-Test_P_00038_LEFT_CC_MASK_1.dcm
│   │  ...
│   │  └── Calc-Test_P_XXXXX_LEFT_MLO_MASK_1.dcm
│   │
│   └── Train
│       ├── Calc-Train_P_XXXXX_LEFT_MLO_FULL.dcm
│       ...
│       └── Calc-Train_P_XXXXX_RIGHT_CC_MASK_1.dcm
│
└── Mass
    ├── Test
    │  ├── Mass-Test_P_XXXXX_LEFT_CC_FULL.dcm
    │  ...
    │  └── Mass-Test_P_XXXXX_LEFT_MLO_MASK_1.dcm
    │
    └── Train
        ├── Mass-Train_P_XXXXX_LEFT_MLO_FULL.dcm
        ...
        └── Mass-Train_P_XXXXX_RIGHT_CC_MASK_1.dcm
┌─────────────────┐
│ bold : folder   │
│ non-bold : file │
└─────────────────┘

I created a set of helper functions to achieve the organised folder structure above.

  • new_name_dcm() reads the .dcm file and renames it from 1-1.dcm or 1-2.dcm to a more descriptive name.
  • move_dcm_up() then moves the renamed .dcm file up the nested folder structure into its parent folder.
  • delete_empty_folder() then deletes any empty folders (after the recursive naming and moving of .dcm files).

On top of all these, count_dicom() counts the number of .dcm files before and after the restructuring, just to make sure that the number of .dcm files before and after the restructuring are the same. The code snippet below shows the main function that does this restructuring. For details of each helper function, refer to the project’s repository.

6. Data Exploration

6.1. Dealing with Cases with More than One Abnormality

Fig 6. Distribution of mammograms that contain calcifications and mammograms that contain mass abnormalities. Image drawn by author.

Of the 2,620 mammogram scans in the dataset, there are 1,592 that contain mass abnormalities (the rest contain only calcification abnormalities). These 1,592 mammograms are the ones that we will be working with. Of these 1,592 scans, 71contain more than 1 mass abnormalities. These 71 mammograms will thus have more than 1 binary masks (1 for each abnormality).

Fig 7. Distribution of mass abnormality count in mammograms (between those with 1 mass abnormality and those with >1 mass abnormalities). Image drawn by author.
Fig 8. Example of a mammogram with more than 1 abnormality. Mammograms and masks retrieved from CBIS-DDSM. Illustrations by author.

It is important to take note of the mammograms with more than 1 mass abnormalities because their masks should not be treated as separate. Intuitively, if we treat each mask as separate labels for the same image, the model would be confused each time it sees a different mask but for the same image. Hence, we should sum the masks into a single mask and use this summed mask as the single only label for the mammogram. The code on how to sum the masks will be covered in Part 2.

Fig 9. Example of summing masks of mammograms with more than 1 mass abnormalities. The summed mask will be used as the only label for its corresponding input image when training the segmentation model. Mammograms and masks retrieved from CBIS-DDSM. Illustrations by author.

Up Next, Part 2: Image Preprocessing

In this article, we covered the motivation behind our problem statement, what is semantic segmentation and the important data of the CBIS-DDSM dataset that we will be using.

In Part 2, we will be breaking down the intuition of the various image preprocessing techniques that I employed, as well as the code implementation.

As always, the code for this project can be found on my Github in this repository.

See you in Part 2 and Part 3!

Thank you

If you’ve made it to the end of this article, I hope that you enjoyed the read. If this article brought some inspiration, value or help to your own projects, feel free to share this with your community. Also, any constructive questions, feedback or discussions are definitely welcome, so please feel free to either comment down below or reach out to me on LinkedIn here or Twitter at @CleonW_.

Follow me on Medium (Cleon Wong) to stay in the loop for my next articles!

Computer Vision
Machine Learning
Segmentation
Tutorial
Data Science
Recommended from ReadMedium