A young developer has created a deep learning model using Python and convolutional neural networks (CNNs) to detect malaria with 98% accuracy, addressing a critical health issue in Africa amidst the COVID-19 pandemic.
Abstract
In response to the World Health Organization's (WHO) alarming statistics on malaria, which disproportionately affects African populations, a 16-year-old developer has leveraged Python and deep learning techniques to create a model capable of detecting malaria from blood sample images with high accuracy. The project, which employs CNNs and transfer learning with the VGG-19 model, aims to provide an accessible and cost-effective solution for mass testing in rural areas, where specialized medical personnel and resources are scarce. Despite the global focus on COVID-19, the developer's work underscores the importance of continued efforts against other deadly diseases. The model's success, with a 98% accuracy rate, demonstrates the potential for AI to significantly improve healthcare outcomes in resource-limited settings.
Opinions
The author believes that governments should prioritize protecting their citizens by providing effective and cheap ways to identify malaria victims, especially when attention has shifted to COVID-19.
There is an emphasis on the importance of early detection in combating malaria, with scientists agreeing it is key to survival.
The author expresses concern that the situation for malaria victims may worsen due to the reallocation of resources to fight COVID-19.
The author praises the use of CNNs and transfer learning as effective methods for image classification tasks in medical applications.
The author highlights the potential of data augmentation and fine-tuning pre-trained models to improve the accuracy of malaria detection.
The author is optimistic about the role of AI in healthcare, suggesting that models like the one developed can save thousands of lives by enabling mass detection of infected patients without the need for specialized equipment or medical personnel.
How I Helped The WHO Deal With One Of Africa's Deadliest Medical Crisis’ Using Python And Deep Learning
The complete guide on how to combine Python and DL to detect whether a person suffers from malaria with a 98% accuracy.
“ I was truly amazed by how easy it was to read the article despite the fact I know nothing about artificial intelligence, computer science, or medicine! “
According to the World Health Organization (WHO), in 2018 alone, more than 228 million people were diagnosed with malaria, out of which 416 000lost their lives. Although these numbers are alarming by themselves, what is even more alarming is the fact that 93% of the total cases (213 million), as well as 67% of the total deaths (272 000), originated from the African region.
In fact, malaria is believed to have been eradicated in most parts of the world. The majority (if not the entirety) of the places where malaria is observed to still be a cause for concern, can be seen in figure 1.0.
In other words, if you live in Europe or North America, there is virtually no chance you have ever been exposed to malaria (at home). At best, you may have heard about it in the news.
Despite the deadliness and severity of the aforementioned disease, as with most others, the spotlight has shifted targets and COVID19 has received the entirety of the world’s attention, forgetting at the same time diseases such as malaria that will not seize to exist simply because a novel deadly virus has immerged.
Table of Contents
Simply want to see what this is about and the final results?
As mentioned before, a westerner’s exposure to the term malaria is limited, if not non-existent. It is, thus, important to develop a basic understanding of what malaria is and why it is so deadly.
Malaria, also called “jungle fever”, is a mosquito-borne infectious disease that can affect both humans and other animals. The disease is caused by microorganisms that are part of the Plasmodium group. Malaria is most commonly spread by an infected female Anopheles mosquito. The way the infection takes place is the following:
The mosquito bites a potential victim
The parasites are introduced into the person’s blood through the mosquito’s saliva.
The parasites travel to the liver where they mature and then reproduce.
For more information concerning malaria, this video provides some additional, interesting information:
What is the problem?
Although there are many treatment centers specializing in combatting malaria, the problem is that the citizens of the more rural and distant (from big cities) areas have no effective way of employing mass-testing and treatment.
What is the real problem though? Why did I get involved with detecting malaria now?
As stated before, due to the novel coronavirus, attention has shifted from this deadly disease. As if that was not enough, African Nations have started re-allocating resources in order to fight COVID-19. Unfortunately, the resources allocated to this new and deadly fight were previously used in the race against malaria. Even then, the resources were not enough. Medical personnel was, and continues to not be specialized enough and do not have the required knowledge to detect and treat malaria. This is why a computer model would be suitable. That is, to perform rapid and mass detections of infected patients without the need for specialized equipment and medical personnel.
Malaria detection is already limited. Limiting it, even more, will have lethal results. This logical assertion made the WHO, on April 23, 2020, urge counties to move quickly and save the lives of malaria victims in sub-Saharan Africa.
Sadly, it is believed that the fate of malaria victims will only become worse, as COVID-19 continues to evolve and mutate.
Project Manifesto (My Solution)
Watching the above happenings taking place, I realized that things would not get better. If governments are not willing or able to protect their citizens and provide them with effective and cheap ways to identify if they are currently victims of malaria, someone else has to.
Remember! Scientists agree that the most important way to succeed in surviving from malaria is early detection.
Having the aforementioned thought process in mind, responding to WHO’s call for help, in this paper, I will be showcasing my progress in building and comparing different models that successfullydetect if a person suffers from malaria.
The model requires a simple image of one’s blood and according to that, it will make its prediction. The reason for which this form of testing is preferred is the level of ease in implementing it. There are numerous portable-amateur microscopes, that are both cheap and can be found in markets all around Africa. In addition to that, there are specific phone-camera extensions that can be used to take the same blood samples.
“ Truly astonishing! Just by thinking that the author is only 16 years old makes me shiver and contemplate my life choices. “
Key Terms
It is crucial, in order to proceed to become acquainted with certain key-terms that will be used throughout this article.
CNNs are regularized versions of multilayer perceptrons. Multilayer perceptrons usually mean fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The “fully-connectedness” of these networks makes them prone to overfitting data.
(For more information on CNNs, this article is an excellent resource)
Transfer Learning
The definition used for transfer learning has been taken from this article.
Transfer learning is an approach in deep learning (and machine learning) where knowledge is transferred from one model to another.
Def: Model A is successfully trained to solve source task T.a using a large dataset D.a. However, the dataset D.b for a target task T.b is too small, preventing Model B from training efficiently. Thus, we use part of model A to predict results for task T.b.
A common misconception is that training and testing data should come from the same source or be with the same distribution.
Using transfer learning, we are able to solve a particular task using full or part of an already pre-trained model on a different task.
A magnificent explanation of transfer learning can be found in the youtube video by Andrew Ng below:
VGG-19 Model
VGG is a Convolutional Neural Network architecture, It was proposed by Karen Simonyan and Andrew Zisserman of Oxford Robotics Institute in the year 2014. It was submitted to Large Scale Visual Recognition Challenge 2014 (ILSVRC2014) and The model achieves 92.7% top-5 test accuracy in ImageNet. ImageNet is one of the largest data-sets available. It has 14 million hand-annotated images of what is in the picture.
Image augmentation artificially creates training images through different ways of processing or combination of multiple processing, such as random rotation, shifts, shear, and flips, etc.
Overfitting
In statistics, overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may, therefore, fail to fit additional data or predict future observations reliably”.[1] An overfitted model is a statistical model that contains more parameters than can be justified by the data.[2] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented the underlying model structure.
Epochs
An epoch is a term used in machine learning and indicates the number of passes through the entire training dataset the machine learning algorithm has completed when training.
Batch Size
Batch size is a term used in machine learning and refers to the number of training examples utilized in one iteration.
Creating the Solution
Methods Presented
In this article, I will be testing three different deep learning techniques, in order to detect malaria.
Method 1: Convolutional Neural Network (CNN)
Method 2: Transfer learning with frozen pre-trained CNN
Method 3: Fine-tuned pre-trained CNN with image augmentation
Preparing the dataset and work environment
First, a supported version of python is needed to be installed. To do so, navigate to this link and follow the instructions for the operating system of choice.
I will be using Python 3.6.9 and Ubuntu 18.04.4 LTS as my Operating System. Nevertheless, all supported python versions are welcome.
Before proceeding with installing the required libraries, pip must be also installed. (I am pretty certain that pip comes with all python versions after 2.7.9 but if pip is not already installed, follow this guide.)
Libraries
The following libraries should be installed with pip:
Having the right dataset is undoubtedly one of the most important aspects of any data science project. In this case scenario, the dataset we need is an archive of human blood samples, classified as either infected or healthy.
The way you acquire said database depends entirely on your method of preference. In this instance, I will be using a Kaggle dataset containing a total of 27,558 classified images, found here.
Coding
Now that both the libraries and dataset are set-up, it is time to begin the actual coding of the model (I will be using a Jupyter notebook).
I will begin by importing all necessary libraries and dependencies:
Now that everything has been imported, I will be importing the dataset into the model.
It appears that both folders (“Parasitized” and “Uninfected”), have 13,779 images each.
I will be working with pandas data-frames, thus a data-frame with the name “data” is going to be created, which will have two features, “filename”, and “label”.
All deep learning models require a training, and testing set. In this case scenario, we are also going to add a validation set. Hence, 70% of the data will be set as training data, 20% as testing, and 10% for the validation set.
Due to the nature of the images supplied to the model, a crucial problem arises. The user will enter his/her self-taken pictures. These pictures will be of various sizes, orientations e.t.c. This can be resolved by using some handy libraries such as “cv2”.
A desirable image size would be 125x125 pixels. I will be using parallel processing, in order to speed up the computations’ speed required for loading and resizing each image.
Every single image should have now converted to 125x125. Plotting the data is always a good idea. With that being said, let’s visualize a sample pack of the dataset.
Certain patterns can be easily observed between healthy, and infected blood cells. The models to be constructed should be able to properly identify the core differences between an infected and healthy cell, and classify them.
Before doing so, some basic settings should be set up (“BATCH_SIZE” and “EPOCHS” can be changed in order to reach higher accuracy).
By completing this step, we have successfully adjusted the image’s dimensions, epochs, batch size, and have encoded the categorical class labels.
Model 1: Convolutional Neural Network (CNN)
To begin creating the first model, I will be defining the model’s architecture.
First model’s architecture
As can be seen above, the model consists of three convolution and pooling layers, two dense layers, and dropouts used for regularization.
It is now time to train the model:
Training the first model
In order to have a clear perspective of the model’s progress and accuracy, I will be plotting its accuracy and loss curves.
First model’s performance
It appears that although the accuracy on the training data is pretty high, there is some overfitting as well. Nevertheless, I will be saving the model in order to use it later.
Model 2: Transfer learning with frozen pre-trained CNN
I will be using TensorFlow to import VGG-19 and freeze the convolution blocks in order to act as a feature extractor. The dense layers will be added at the end and perform the classification.
At the moment, there are 28 layers in total, out of which, 6 are trainable. I will be using the same setting used with the first model and train it to view the results.
Second model’s performance
It appears that the second model does not overfit as much as the first. At the same time, its accuracy is slightly less.
Model 3: Fine-tuned pre-trained CNN with image augmentation
In this model, I will be fine-tuning the last two blocks of the VGG-19 model. Some image augmentation will also take place in order to create better, altered versions of the original images and reach better results (the validation dataset will obviously not be augmented, as it will be used to evaluate the model’s performance per echo).
Let's view some of the augmented images:
Augmented images sample
The differences with the original pictures are obvious. To continue, I will be making the model and, at the same time, be making sure that the last two layers are trainable.
I will be now making some final alterations (reduce the learning rate e.t.c.), and be training the final model.
Third model’s performance
The training of all three models has successfully elapsed. In order to evaluate their accuracy and f-1 scores, I will be using a snippet from an open-source third-party code from Github. The author of the code is “DIP”, and it can be found here.
Results / Conclusion
After creating and testing all three models, I have come to certain conclusions.
The first model (a basic CNN model), presented an accuracy and F1 score of 95.95%, BUT significant overfitting was observed.
The second model (a VGG-19, frozen pre-trained model) presented an accuracy and F1 score of 94.87%.
Finally, the third model (a VGG-19 fine-tuned model) presented an accuracy and F1 score of 98.0%.
In simple terms, the best out of the three managed to classify a blood sample as infected or not with a 98% accuracy. This is more than desirable as it means that it can significantly outperform the classification accuracy of doctors. It is also worth mentioning that the blood-samples images were of different sizes, quality e.t.c. By using data-augmentation, all images were ameliorated significantly, and their classification became possible by the model.
To conclude, it becomes apparent that the lives that can be saved by the mass adoption of such models fall in the thousands, if not millions. It is, thus, crucial that researchers constantly outperform previous models and increase the models’ accuracy.
Do you want to learn more?
If you want to advance your knowledge and are interested in making money with machine learning I highly encourage you to read the articles listed below: