Click to navigate: Part 1 -> Part 2 -> Part 3
Segmenting Abnormalities in Mammograms (Part 1 of 3)
A step-by-step guide to implementing a deep learning semantic segmentation pipeline on mammograms in TensorFlow 2
If you are reading this article, chances are that we share similar interests and are in similar industries. So let’s connect via LinkedIn, where I share tidbits of my thoughts and resources about AI and ML!
Article Structure
This article is Part 1 of a 3-part series that walks through how I tackled a deep learning project of identifying mass abnormalities in mammogram scans using an image segmentation model. As a result of breaking down the project in detail, this serves as a comprehensive overview of one of the core problems in computer vision — semantic segmentation, as well as a deep dive into the technicalities of executing this project in TensorFlow 2.
Part 1:
- Problem statement.
- What is semantic segmentation.
- Guide to downloading the dataset.
- What you’ll find in the dataset.
- Unravelling the nested folder structure of the dataset.
- Data exploration.
Part 2:
- Image preprocessing pipeline overview.
- General issues with the raw mammograms.
- Deep dive into raw mammogram’s preprocessing pipeline.
- Deep dive into corresponding mask’s preprocessing pipeline.
Part 3:
- Introducing the VGG-16 U-Net model.
- Implementing the model in TensorFlow 2.
- Notes on training the model.
- Results and post analysis.
- Wrapping up.
GitHub Repository
The code for this project can be found on my Github in this repository.
1. Problem Statement
The goal of the project is to segment mass abnormalities in scanned film mammogram images.
Full mammogram scans will serve as 2D inputs into an image segmentation model with their respective binary masks as the ground truth labels. The model will output a predicted mask for each mammogram.
1.1. The Chosen Dataset — CBIS-DDSM
I chose to use the CBIS-DDSM dataset because it is suitable for computer vision projects of intermediate complexities. With 2,620 scanned film mammography images, it is large enough to conduct decent model training. Furthermore, since the CBIS-DDSM dataset contains real world mammogram scans, they are “messy” enough such that robust and intentional image preprocessing needs to be done in order to achieve decent results at the task at hand. Image preprocessing is covered in Part 2 here.
Most interestingly, each mammogram in the dataset comes with a binary mask that indicates only a general position of the abnormalities. Since these masks do not provide precise segmentations, there is a need to implement segmentation algorithms for accurate feature extraction and diagnosis.
These are the main reasons why I have chosen the CBIS-DDSM dataset to tackle the semantic segmentation task.
2. What Is Semantic Segmentation?
Simply put, any kinds of segmentation (yes, there is more than one kind of segmentation and yes, there are other methods apart from segmentation, namely localisation and detection) answers the following question:
“Where is the object of interest located in the image?”
Finding where the objects of interest are in an image is a natural step from image classification when in the business of scene understanding. Image classification tells us “what is in the image” (i.e. it makes a prediction about the entire input image). Localisation, detection and segmentation then tells us “where is [the object of interest] located in the image”.
There are two main forms of pose information that tell us where objects are located in an image — bounding boxes and masks. Object localisation models and object detection models outputs predicted bounding boxes, while image segmentation models outputs predicted masks. Image segmentation can then be further broken down into semantic segmentation models and instance segmentation models. As such, the four main types of techniques for locating objects in images are:
- Object localisation: involves locating one instance of an object class (a.k.a label), usually by predicting a tightly cropped bounding box centred on the instance. It usually comes with classification, the common terminology that you will encounter is ‘classification + localisation’.
- Object detection: involves detecting multiple instances of one or more object class in an image. Similar to object localisation, it predicts bounding boxes around each detected instance of every object class.
- Semantic segmentation: predicts for each pixel of an image the object class (amongst two or more object classes) it belongs to. All object classes must be known to the model. The output is a predicted mask.
- Instance segmentation: is a more elaborate form of semantic segmentation. The difference is that it is able to differentiate two instances of the same object class. For example, it is able to separate one pedestrian from another pedestrian in an image of a sidewalk. The output is also a predicted mask.
These are my understandings of the techniques. However, note that these terms are not as clearly defined in the scientific community as we would like, so you may encounter slightly different meanings for any of them. You may refer to this, this and this article for a more thorough understanding of the above four concepts.
3. Downloading the Data
The dataset can be found here, from The Cancer Imaging Archive.
Downloading the data is pretty straightforward if you follow the general steps below. Note that I am working on a Mac, there might be slight difference when working on other systems.
Step 1— Install the NBIA Data Retriever from the Mac App Store. Follow this link for detailed instructions.
Step 2 — Download the .tcia
manifest file to your local computer.
Step 3 — Open the just-downloaded .tcia
manifest file using the NBIA Data Retriever.
Step 4 — In the NBIA Data Retriever, click ‘Browse’ to select the directory that you would like to save the dataset in.
Step 5 — Then click ‘Start’ to start downloaded the dataset.
Heads up, because this dataset is around 160GB, it might take a while to download.
4. What you’ll find in the dataset
The dataset contains breasts that contains calcification, mass abnormalities, or both. This article does a good summary of what masses and calcifications are.
The messy and nested folder structure that you’ll see in the downloaded dataset will be explained and resolved in Part 2. For now, we will focus on understanding the types of data that is in the dataset.
The dataset contains two main types of files — DICOM (.dcm
) files and .csv
files.
4.1. The .dcm files
The images in the CBIS-DDSM dataset are by default decompressed into the DICOM format. Refer to this, this and this for a brief overview of the DICOM format and how to work with DICOM in Python.
A patient can have two mammogram scans done for the same breast, each in a different view (namely MLO and CC). Each of these mammogram comes with the following 3 kinds of images:
- Full mammogram scan: The actual mammogram.
- Cropped image: A zoomed-in crop of the mass abnormality.
- Region of interest (ROI) mask: The ground truth binary mask that locates the mass abnormality.
These images are each decompressed into its own .dcm
file. There is a separate .dcm
file for each of these 3 kinds of images (i.e. a breast with only 1 mass abnormality will have 3 .dcm
files, one for each of the 3 kinds of images). Each.dcm
file contains the image (in the form of arrays) and other information about the scan such as Patient's ID
, Patient Orientation
, Series Description
, Modality
and Laterality
.
You will find breasts with more than 1 mass abnormalities. These cases will have a cropped image and a ROI mask for each mass abnormality.
4.2. The .csv files
The .csv
files serve as a directory for the mammogram scans. There are 4 .csv
files:
- Calc-Test-Description.csv
- Calc-Train-Description.csv
- Mass-Test-Description.csv
- Mass-Train-Description.csv
Each of these files contains information about each mammogram such as breast density, image view and pathology. Notice that there are repeats of Patient ID down the rows. Again, this shows that a single patient can have multiple mammograms (either multiple views of the the same breast, mammograms of the left and right breast, or both).
5. Unravelling the Nested Folder Structure of the Dataset
One often overlooked explanation in other projects that use the CBIS-DDSM dataset is reorganising the folder structure into a structure that is easy to work with.
From the snippet below, we see that the original folder structure of the downloaded dataset is nested, has non-descriptive subfolder names and have non-unique .dcm
file names. These make it extremely challenging to feed images into the image preprocessing pipeline (and eventually the model training pipeline). Hence, we will write some code to create a new folder structure to resolve these issues.
5.1. Explaining the original folder structure
Folder structure BEFORE restructuring
=====================================
CBIS-DDSM
│
├── Calc-Test_P_00038_LEFT_CC
│ └── 1.3.6.1.4.1.9590...
│ └── 1.3.6.1.4.1.9590..
│ └── 1-1.dcm <--- full mammogram scan
│
├── Calc-Test_P_00038_LEFT_CC_1
│ └── 1.3.6.1.4.1.9590...
│ └── 1.3.6.1.4.1.9590...
│ ├── 1-1.dcm <--- binary mask? cropped image?
│ └── 1-2.dcm <--- cropped image? binary mask?
...
┌─────────────────┐
│ bold : folder │
│ non-bold : file │
└─────────────────┘
Patient P_00038’s left CC mammogram scan is saved as a 1-1.dcm
file under the parent folder Calc-Test_P_00038_LEFT_CC
. Breaking down the parent folder name, we have:
Calc
(orMass
): The type of abnormality present in the mammogram.Test
(orTrain
): The mammogram belongs to the test set (CBIS-DDSM has already split the mammograms into a train and test sets).P_00038
: The patient’s ID.LEFT
(orRIGHT
): The left breast.CC
(orMLO
): The orientation that the mammogram scan was done in.
Patient P_00038’s left CC mammogram scan has its corresponding binary mask and cropped image. These are saved in a separate folder Calc-Test_P_00038_LEFT_CC_1
(notice the _1
at the end of the folder name). Here comes the messy bit — it is impossible to tell just from the filenames alone whether 1-1.dcm
and 1-2.dcm
are the binary mask and the cropped image respectively or, vice versa. If a mammogram has more than one abnormalities, then each abnormality’s corresponding mask and cropped image will be saved in a similar fashion but in separate folders ending with _2
, _3
, and so on.
5.2. Creating the new folder structure
The snippet below shows the new folder structure after restructuring. It is now no longer nested and individual files have unique and descriptive filenames.
Folder structure AFTER restructuring
=====================================
CBIS-DDSM
│
├── Calc
│ ├── Test
│ │ ├── Calc-Test_P_00038_LEFT_CC_FULL.dcm
│ │ ├── Calc-Test_P_00038_LEFT_CC_CROP_1.dcm
│ │ ├── Calc-Test_P_00038_LEFT_CC_MASK_1.dcm
│ │ ...
│ │ └── Calc-Test_P_XXXXX_LEFT_MLO_MASK_1.dcm
│ │
│ └── Train
│ ├── Calc-Train_P_XXXXX_LEFT_MLO_FULL.dcm
│ ...
│ └── Calc-Train_P_XXXXX_RIGHT_CC_MASK_1.dcm
│
└── Mass
├── Test
│ ├── Mass-Test_P_XXXXX_LEFT_CC_FULL.dcm
│ ...
│ └── Mass-Test_P_XXXXX_LEFT_MLO_MASK_1.dcm
│
└── Train
├── Mass-Train_P_XXXXX_LEFT_MLO_FULL.dcm
...
└── Mass-Train_P_XXXXX_RIGHT_CC_MASK_1.dcm
┌─────────────────┐
│ bold : folder │
│ non-bold : file │
└─────────────────┘
I created a set of helper functions to achieve the organised folder structure above.
new_name_dcm()
reads the .dcm file and renames it from1-1.dcm
or1-2.dcm
to a more descriptive name.move_dcm_up()
then moves the renamed .dcm file up the nested folder structure into its parent folder.delete_empty_folder()
then deletes any empty folders (after the recursive naming and moving of .dcm files).
On top of all these, count_dicom()
counts the number of .dcm files before and after the restructuring, just to make sure that the number of .dcm files before and after the restructuring are the same. The code snippet below shows the main function that does this restructuring. For details of each helper function, refer to the project’s repository.