
Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening
- We created a large dataset of mammograms, consisting of over 1,000,000 mammographic images (a.k.a the NYU Breast Cancer Screening Dataset), with accompanying cancer labels and lesion segmentations where applicable.
- We designed and trained a novel two-phase model for breast cancer screening that performs on par with expert radiologists in identifying breast cancer using screening mammograms.
- We are making publicly available our paper, tech report explaining our data and code and trained models.
Breast cancer is the second leading cancer-related cause of death among women in the US. Early detection, through routine annual screening mammography, is the best first line of defense against breast cancer. However, these screening mammograms require expert radiologists (i.e. physicians who completed residency training specifically in image interpretation) to pore over extremely high-resolution images, looking for the finest traces of cancerous or suspicious lesions. A radiologist can spend up to 10 hours a day working through these mammograms, in the process experiencing both eye-strain and mental fatigue.
Modern computer vision models, built principally on Convolutional Neural Networks (CNNs), have seen incredible progress in recent years. Breast cancer screening thus presents an obvious and important opportunity for the application of these novel computer vision methods. However, to truly effectively employ deep learning techniques to the problem of breast cancer screening, two important challenges need to be addressed.
Challenges
Firstly, deep learning models need to be trained on large amounts of data to be effective. For example, the ImageNet training set contains millions of images. In contrast, the most commonly used public mammography data set, the Digital Database for Screening Mammography (DDSM), contains only about ten thousand images.

Secondly, mammographic images have extremely high resolutions. For instance, models tackling ImageNet often handle images of size 256x256. In contrast, mammogram image sizes are on the order of 2000x2000. That’s two orders of magnitude larger than what typical CNNs work with. It would be impractical to train, say, a ResNet-151 on images of this sizes on commercial-grade GPUs, as the activations and gradients would struggle to fit in GPU memory. In addition, our prior work has shown that downsampling such images can meaningfully hurt performance because of the loss of visual detail, so downsampling is not a viable alternative.

The work of our team addresses both of these issues. Let’s start with the need for more data.
The NYU Breast Cancer Screening Dataset (BCSD)
In collaboration with the Department of Radiology at NYU Langone Health, we built a massive dataset of anonymous screening mammograms, specifically with the training of deep learning models in mind. Our full dataset consists of 229,426 screening mammography exams, from 141,473 patients. Each exam consists of at least four images, corresponding to the four standard mammographic views. This dataset is enormous by medical imaging standards, let alone mammography.

Within this dataset, about 7,000 exams were matched with corresponding biopsies (referred to as the “biopsied population”, compared to the above “screening population”). Biopsies are surgical procedures that affirmatively confirm the presence or absence of cancer. We parsed the biopsy report for each exam to extract labels indicating the presence of cancer in each breast. Furthermore, we asked 12 radiologists to hand-annotate the location of biopsied findings on each mammographic image, where applicable. This gives us pixel-level labels of where the malignant and benign findings are located.

This dataset forms the foundation for all the experiments in our work, and more details about the creation, processing, extraction and statistics of the dataset can be found in our technical report.
With the data limitation out of the way, we can get to the modeling.
Deep Learning for Breast Cancer Screening

We use a multi-view CNN, which takes as input the four mammographic images. The model consists of four custom ResNet-22 columns — one for each mammographic view — followed by two fully connected layers that output four separate prediction labels: the presence of benign and malignant findings, in left and right breasts. In total, this model has about 6 million trainable parameters. We refer to this model as the “image-only” model, for reasons that will soon become apparent.
We train the model on a mix of data from the exams matched with biopsies, which have explicit cancer labels from biopsy reports, as well as exams not matched with biopsies, which we treat as the absence of cancer. The model is trained to predict all four labels for each exam.
Although we have a large dataset of mammograms, because we have relatively few positive examples of cancer compared to negative examples, training a model from scratch with such a limited training signal can be problematic. We take a quick detour to address this issue.
Transfer Learning with BI-RADS Classification
Although we cannot use pretrained image classification models for our task, we can still apply transfer learning to cancer classification. Building on our prior work, we pretrain a model for a similar task: BI-RADS classification.
In mammography, BI-RADS labels are a standardized radiologist assessment of a patient’s chance of having breast cancer based on a mammogram. Being primarily an estimate of risk used to guide further workup and testing (if indicated) rather than a diagnosis, BI-RADS categories are a much noisier training signal than cancer labels from biopsies. Nevertheless, they correspond roughly to the downstream problem of cancer classification, allowing the model to learn to identify exams suspicious for malignancy, exams which contain clearly benign findings and exams which are clearly normal.

The architecture of our BI-RADS classification model is identical to that of our cancer classification model, except the final layer outputs BI-RADS categories. After training the model on BI-RADS classification, we then take the weights from the ResNet-22 columns and use them to initialize our cancer classification model.
With transfer learning from BI-RADS classification, we can already train a fairly accurate breast cancer classification model. However, this comes at the cost of using a model with less capacity, with a shallower ResNet for processing visual features. Is there a way that we can take advantage of a deeper model while still training on full-resolution mammograms?
Clearly, something has to give. In our case, we make the following compromise: What if we give up end-to-end training?
Patch-level Cancer Classification
Remember the pixel-level segmentations? They could serve as an extremely powerful training signal, telling us exactly where a suspicious change is located in a mammogram. While we could directly train a localization model for identifying cancers, here we’ll target something much simpler: classifying patches.
From our big dataset of mammograms, we randomly sample 256x256 patches from our full mammogram images. If a sampled patch overlaps with an annotated lesion, we assign the patch the corresponding label: malignant, benign or a negative label. Then, we train a model to classify these patches.

With 256x256-sized patches, not only can we use a model with higher capacity, it also means that we can use off-the-shelf models trained for ImageNet. We experimented with a large number of existing models, and found DenseNet-121 to perform best. Because we’re sampling small patch from large mammograms, we can sample and train a very large number of these patches–we end up training the model for 1,000 epochs on 5,000,000 sampled patches. This model performs exceedingly well on this patch-classification task.
Importantly, the patch-classification model has a significant limitation: by operating on small patches of the full mammograms, it lacks information from the full context of a mammogram. In practice, radiologists routinely make clinical determinations based on whole breast evaluation comparing tissue in different regions of a breast, different mammographic views of the breast or even between breasts. The patch classification model, while having extremely high capacity, is constrained to use very local features. It effectively misses the forest for the trees.
In contrast, our “image-only” model was able to use information from all parts of a mammogram, but had far less capacity. So is there a way we can combine the benefits of both?
Patch-Classification Heatmaps: Seeing both the forest and the trees
Since our patch-level classifier is so good at mammogram patches, why don’t we try using the output of the patch-level classifier as an input to the “image-only” model?
We do just that: we apply the (trained and frozen) patch classifier in a sliding window fashion across the entire high-resolution mammogram, generating a sort of “heatmap” of predictions. We extract heatmaps for both the benign and malignant predictions from the patch classifier.




