Things they don’t tell you about installing Mask R-CNN for custom datasets!

Mask R-CNN is a powerful deep learning model widely utilized for instance segmentation tasks. It excels at detecting objects and precisely segmenting them at the pixel level. The model builds upon the Faster R-CNN architecture, seamlessly integrating object detection with instance segmentation.
There are several tutorials and repositories on the Internet about Mask R-CNN, including an official repository provided by the original authors at Matterport. The repository contains the code and pre-trained models for Mask R-CNN. I used this repository to build my custom model.
When I was creating my own Mask R-CNN model for custom datasets, the most difficult thing for me to do was create a virtual environment with all dependencies compatible with one another. It took me a lot of hours to figure out what was not working. All I knew was that the dependencies were not compatible with each other. But after several combinations, I managed to get the right ones together.
Before proceeding, I have Windows 11 with the NVIDIA GPU driver installed (528.02) and 20 GB of GPU memory. Make sure that you have installed CUDA and CUDNN on your computer. I used CUDA 10.0 and CUDNN 7.4 on my computer.
I created a virtual environment in Python 3.7.11. I tried several, and this one worked for me. Then I installed all the required dependencies. If you access the original requirements.txt from Matterport, the versions are not specified, so it can create conflicts. If you are using Python 3.7.11, you can install the following dependencies:
- tensorflow==2.2.0 - keras==2.3.1 - numpy==1.20.3 - scipy==1.4.1 - pillow==8.4.0 - cython==0.29.24 - scikit-image==0.16.2 - matplotlib - opencv-python==4.5.4.60 - h5py==2.10.0 - imgaug==0.4.0 - IPython[all]
Even after these dependencies were installed, I was getting an error that was related to protobuf, and to fix it, I had to install protobuf 3.8. After that, the environment was compatible with the Mask R-CNN.
The important thing needed to train the Mask R-CNN model is a dataset. You need to classify your dataset into training, validation, and testing datasets. It is likely to make training and validation datasets in the ratio of 9:1. Annotate both the training and validation datasets using an annotation tool like VIA or Makesense. I have a guided article for creating annotations using makesense.ai here. I used JSON files in the code and made the model for one class (class + background). If you lack images to form datasets, you can perform image augmentation using Python and modify the image in terms of shear, rotation, scale, etc. I had 100 images, and I augmented them to 400 for my experiment. You can find my code here.
To expedite the training process and improve performance, you can start with a pre-trained Mask R-CNN model on a large-scale dataset such as COCO (Common Objects in Context). Transfer the weights of the pre-trained model to your custom model, excluding the classification head. You can download the pretrained coco weights from here. If you have trained your model once, you can use the trained weight next time to train the model again.
You can evaluate the performance of your trained model using appropriate metrics such as mean average precision (mAP). Use a separate validation set or perform cross-validation to assess the model’s accuracy and fine-tune hyperparameters if necessary.
Once you are satisfied with the model’s performance, you can use it for inference on new, unseen images that are in the testing dataset. The model will detect objects, generate bounding boxes, and generate masks for the objects of interest.
If your model doesn’t achieve satisfactory results, you can fine-tune it by adjusting the hyperparameters or collecting more training data to improve its performance.
My model just had one class, and it could generate masks, provide prediction percentages, detect all the objects with the same color, save the results in a new folder, and run multiple images at the same time.
It’s worth noting that implementing Mask R-CNN from scratch can be a complex task, especially if you are new to deep learning. Utilizing existing implementations and libraries can significantly simplify the process and save time.
Overall, Mask R-CNN generates binary masks for each detected object and achieves an exceptional level of granularity in understanding and segmenting objects in images. With its ability to handle multiple instances of the same object class and its versatility in detecting various object categories, Mask R-CNN emerges as an astonishing tool for instance segmentation. Its wide range of applications, including autonomous driving, medical imaging, and interactive image editing, coupled with its pre-trained models and open-source implementations, make it a truly extraordinary and powerful model in the field of computer vision.





