avatarTeendifferent

Summary

The web content introduces Grounding DINO as a groundbreaking solution for simplifying the image annotation process in object detection tasks, enabling the detection and labeling of arbitrary objects with minimal human input.

Abstract

The article "Simplifying Object Detection: Annotate Your Custom Dataset with Grounding DINO" discusses the challenges of manual image annotation for object detection and introduces Grounding DINO as an innovative approach to address these issues. Grounding DINO combines a Transformer-based detector with grounded pre-training to detect a wide range of objects, even those not seen during training. The method is particularly useful for creating annotated datasets for diverse applications such as autonomous vehicles, retail, healthcare, and surveillance. The article also provides a step-by-step guide on using Grounding DINO within a Google Colab environment to automate the annotation process, demonstrating its effectiveness in saving time and resources while maintaining high accuracy and precision.

Opinions

  • The author strongly advocates for Grounding DINO, considering it one of the best solutions for object detection annotation after conducting extensive research.
  • The author expresses frustration with the time-consuming nature of manual annotation and emphasizes the need for a more efficient process.
  • There is an endorsement for the tool SnapSwift as a solution for efficiently gathering images without the need for manual scraping, suggesting it as a time-saving alternative.
  • The author highlights the importance of accurate annotations for the performance of machine learning models in object detection tasks.
  • The article suggests that Grounding DINO's ability to recognize objects based on textual descriptions is particularly beneficial for domains where it's impractical to have training data for every possible object.
  • The author believes that tools like Grounding DINO are crucial for the evolution of AI and machine learning, as they bridge the gap between visual data and computer interpretation.

Simplifying Object Detection: Annotate Your Custom Dataset with Grounding DINO

Generated Image. OpenAI’s DALL·E

Navigating through the realm of object detection could be sometimes a bit of a daunting task. This is more apparent while training models like YOLO with annotated image datasets. If ever there would be custom data that you’d need to work with and define your annotations around, then that might be somewhat time-consuming. But here is great news that I was dying to share with you: the perfect solution has been found Grounding DINO! The approach is groundbreaking in the way that it combines the Transformer-based detector DINO with grounded pre-training to allow the detection of arbitrary objects from scant human input. This easy guide explains to you how effectively your custom dataset can be formatted in the PASCAL VOC format and thereby be usable in your object detection projects with ease and effectiveness that surprises.

As I am working on my object detection project, I researched multiple solutions, manual and AI-driven. I am very well convinced, following my research, that Grounding Dino is one of the best. However, due to very few resources available, it was not immediate. Hopefully, I can share some light into this method so that more people are able to fly through the object detection phase instead of being bogged down by their annotation processes. Long annotation processes, the trash! With this guide, you’ll see how you can annotate your custom data in just a few minutes. Let’s dive in!

Pro Tip: If you’re tired of manually collecting images for your classes or facing issues with Selenium and Chrome integration, I’ve got the perfect solution for you. Check out my blog titled “Stop Wasting Time Scraping Images! Get SnapSwift, Your Python Bing Downloader”. It’s a game-changer for efficiently gathering the images you need without the hassle. Don’t miss out on making your data collection process smoother and more effective!

What’s Image Annotation?

Annotating images for object detection is a fundamental step in training machine learning models to accurately identify and locate objects within images. This process involves labeling images with bounding boxes around each object of interest and providing corresponding labels that describe what each object is. Let’s delve into why this is necessary, its uses, and the challenges involved, including why it can be difficult to find pre-annotated data that meets your specific needs.

Why Annotation is Necessary

  1. Training Data for Machine Learning Models: Object detection models learn to recognize patterns and features that define different objects by looking at examples. The more accurately these examples are annotated, the better the model becomes at detecting objects.
  2. Accuracy and Precision: Proper annotations ensure that the model not only recognizes an object but also precisely locates it within various contexts and backgrounds. This is crucial for applications where precise object localization matters, such as autonomous driving and medical imaging.
  3. Diverse Learning: By annotating images from a wide range of scenarios and object variations, models can learn to generalize better and perform well in real-world situations, reducing the chance of misidentification.

Uses of Annotated Images

  1. Autonomous Vehicles: For detecting pedestrians, other vehicles, and road signs to navigate safely.
  2. Retail: In identifying products on shelves for inventory management or self-checkout systems.
  3. Healthcare: Helping in the diagnosis by pinpointing areas of interest in medical scans.
  4. Surveillance: For monitoring areas to detect suspicious activities or track objects of interest.

Challenges in Manual Annotation

  1. Time-Consuming: Manually drawing bounding boxes and labeling each object in thousands of images is extremely labor-intensive and time-consuming.
  2. Accuracy Required: Precision in drawing bounding boxes is crucial. Inaccurate annotations can lead to poorly trained models.
  3. Expertise Needed: Certain domains require annotators with specialized knowledge, especially in fields like healthcare, where understanding medical imagery is essential.
  4. Scalability: As the amount of data grows, scaling the manual annotation process becomes increasingly challenging without significant investment in human resources.

Difficulty in Finding Pre-Annotated Data

  1. Specific Requirements: Projects often have unique requirements in terms of objects to be detected, making it hard to find datasets that match these needs exactly.
  2. Quality and Reliability: Public datasets may vary in quality and annotation standards, which can affect the performance of object detection models.
  3. Limited Coverage: Certain domains or rare objects might not be well-represented in publicly available datasets.
  4. Legal and Privacy Concerns: Using pre-annotated datasets may come with restrictions or require compliance with privacy laws, especially with images involving people or sensitive locations.

Various Formats

Various formats exist for image annotation, let’s briefly explore a few of them:

  1. PASCAL VOC Format contains XML files detailing image information, object class labels, and bounding box coordinates.
  2. COCO Format contains JSON files containing image metadata, category IDs, and annotations for bounding boxes or segmentation masks.
  3. YOLO Format mostly used for real time object detection containing text files for each image, listing class index and normalized bounding box coordinates per object.

What is Grounding DINO?

Grounding DINO is an innovative approach to object detection that combines the strengths of DINO, a Transformer-based detector, with grounded pre-training to enable open-set object detection. This means it can detect objects not seen during training, making it a zero-shot model. Let’s break down what makes Grounding DINO notable and how it accomplishes image annotation.

It uses PyTorch-based framework designed for open-set object detection, which is the task of identifying and localizing objects that the model has not explicitly been trained to recognize. It achieves this through a combination of DINO’s architecture and a novel grounded pre-training strategy. A zero-shot model like Grounding DINO can recognize objects without having been directly trained on them, thanks to its ability to understand and process language descriptions. Grounding DINO leverages language as a way to bridge the gap between seen and unseen objects, enabling it to identify items based solely on textual descriptions. This capability is particularly useful for applications where it’s impractical to have training data for every object that might be encountered.

How Grounding DINO Annotates Images

Grounding DINO operates by accepting pairs of images and text[labels] as input. It then outputs object boxes with similarity scores across all input words, allowing for the detection of objects based on their textual descriptions. The model uses thresholds to decide which boxes to consider based on the highest similarity scores and extracts words as predicted labels if their scores exceed a certain threshold. This method enables Grounding DINO to not only detect objects but also to understand and label them accurately, essentially automating the annotation process. Users can specify phrases to target specific objects within images, making Grounding DINO a flexible tool for creating annotated datasets.

Here’s a step-by-step guide to using Grounding DINO for auto-annotating images, tailored for a Google Colab environment. This process will guide you through setting up the environment, installing Grounding DINO, downloading your data, and annotating images.

Let’s Annotate

Step 1: Check GPU Availability

Use !nvidia-smi to check if a GPU is available for faster processing.

Step 2: Mount Google Drive (Optional)

If your data is on Google Drive, mount it using:

from google.colab import drive
drive.mount('/content/drive')

Step 3: Set Home Directory

Define a HOME constant to manage datasets, images, and models easily:

import os
HOME = os.getcwd()
print(HOME)

Step 4: Install Grounding DINO

Clone the Grounding DINO repository, switch to a specific feature branch (if necessary), and install the dependencies:

%cd {HOME}
!git clone <https://github.com/IDEA-Research/GroundingDINO.git>
%cd {HOME}/GroundingDINO

# we use latest Grounding DINO model API that is not official yet
!git checkout feature/more_compact_inference_api
!pip install -q -e .
!pip install -q roboflow dataclasses-json onemetric

Step 5: Additional Dependencies & Verify CUDA and PyTorch

Ensure CUDA and PyTorch are correctly installed and compatible:

import torch
!nvcc --version
TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
CUDA_VERSION = torch.__version__.split("+")[-1]
print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)

import roboflow
import supervision
print(
    "roboflow:", roboflow.__version__,
    "; supervision:", supervision.__version__
)
# confirm that configuration file exist
import os
CONFIG_PATH = os.path.join(HOME, "GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py")
print(CONFIG_PATH, "; exist:", os.path.isfile(CONFIG_PATH))

Step 6: Download Configuration and Weights

Ensure the configuration file exists within the cloned repository and download the model weights:

# download weights file
%cd {HOME}
!mkdir {HOME}/weights
%cd {HOME}/weights
!wget -q <https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth>
# confirm that weights file exist
import os
WEIGHTS_PATH = os.path.join(HOME, "weights", "groundingdino_swint_ogc.pth")
print(WEIGHTS_PATH, "; exist:", os.path.isfile(WEIGHTS_PATH))

Step 7: Download and Prepare Your Dataset

If your dataset is zipped in your drive, unzip it to a local directory:

import zipfile
# Path to the zip file
zip_file_path = "/content/drive/MyDrive/....[your file path]"
# Directory to extract the contents of the zip file
extract_dir = "/content/data"
# Unzip the file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)
print("Extraction complete.")

Step 8: Load the Grounding DINO Model

Load the model using the configuration and weights path:

%cd {HOME}/GroundingDINO
from groundingdino.util.inference import Model
model = Model(model_config_path=CONFIG_PATH, model_checkpoint_path=WEIGHTS_PATH)

Step 10: Annotate Dataset and Save to PASCAL voc

Use the model to annotate images. You can run inference in different modes like caption, classes, or enhanced classes depending on your needs. After inference, use the detections and labels to annotate images using your preferred method or the provided utility functions.

Automate the annotation process for your entire dataset by iterating over your images, running the model to detect objects, and saving both the annotated images and their PASCAL VOC XML files.

import os
import cv2
import xml.etree.ElementTree as ET
from groundingdino.util.inference import Model
from tqdm import tqdm

# Define the home directory and the path to the dataset
HOME = "/content"
DATASET_DIR = os.path.join(HOME, "data", "ingredients_images_dataset")
# Load the Grounding DINO model
MODEL_CONFIG_PATH = os.path.join(HOME, "GroundingDINO", "groundingdino", "config", "GroundingDINO_SwinT_OGC.py")
WEIGHTS_PATH = os.path.join(HOME, "weights", "groundingdino_swint_ogc.pth")
model = Model(model_config_path=MODEL_CONFIG_PATH, model_checkpoint_path=WEIGHTS_PATH)
# Load class labels from the file
LABELS_FILE_PATH = "[ txt file path containing your images labels one per line]"
with open(LABELS_FILE_PATH, "r") as f:
    CLASSES = [line.strip() for line in f.readlines()]
# Define annotation thresholds
BOX_THRESHOLD = 0.35
TEXT_THRESHOLD = 0.25
# Function to enhance class names
def enhance_class_name(class_names):
    return [f"all {class_name}s" for class_name in class_names]
# Function to create Pascal VOC format XML annotation
def create_pascal_voc_xml(image_filename, image_shape, boxes, labels):
    annotation = ET.Element("annotation")
    folder = ET.SubElement(annotation, "folder")
    folder.text = "ingredient_annotations"  # Folder name for annotations
    filename = ET.SubElement(annotation, "filename")
    filename.text = image_filename
    source = ET.SubElement(annotation, "source")
    database = ET.SubElement(source, "database")
    database.text = "Unknown"
    size = ET.SubElement(annotation, "size")
    width = ET.SubElement(size, "width")
    height = ET.SubElement(size, "height")
    depth = ET.SubElement(size, "depth")
    width.text = str(image_shape[1])
    height.text = str(image_shape[0])
    depth.text = str(image_shape[2])
    segmented = ET.SubElement(annotation, "segmented")
    segmented.text = "0"
    for box, label in zip(boxes, labels):
        object = ET.SubElement(annotation, "object")
        name = ET.SubElement(object, "name")
        pose = ET.SubElement(object, "pose")
        truncated = ET.SubElement(object, "truncated")
        difficult = ET.SubElement(object, "difficult")
        bndbox = ET.SubElement(object, "bndbox")
        xmin = ET.SubElement(bndbox, "xmin")
        ymin = ET.SubElement(bndbox, "ymin")
        xmax = ET.SubElement(bndbox, "xmax")
        ymax = ET.SubElement(bndbox, "ymax")
        name.text = label
        pose.text = "Unspecified"
        truncated.text = "0"
        difficult.text = "0"
        xmin.text = str(int(box[0]))
        ymin.text = str(int(box[1]))
        xmax.text = str(int(box[2]))
        ymax.text = str(int(box[3]))
    # Format the XML for better readability
    xml_string = ET.tostring(annotation, encoding="unicode")
    return xml_string
# Function to annotate images in a directory and save annotated images in Pascal VOC format
def annotate_images_in_directory(directory):
    for class_name in CLASSES:
        class_dir = os.path.join(directory, class_name)
        annotated_dir = os.path.join(directory, f"{class_name}_annotated")
        os.makedirs(annotated_dir, exist_ok=True)
        print("Processing images in directory:", class_dir)
        if os.path.isdir(class_dir):
            for image_name in tqdm(os.listdir(class_dir)):
                image_path = os.path.join(class_dir, image_name)
                image = cv2.imread(image_path)
                if image is None:
                    print("Failed to load image:", image_path)
                    continue
                detections = model.predict_with_classes(
                    image=image,
                    classes=enhance_class_name([class_name]),
                    box_threshold=BOX_THRESHOLD,
                    text_threshold=TEXT_THRESHOLD
                )
                # Drop potential detections with phrase not part of CLASSES set
                detections = detections[detections.class_id != None]
                # Drop potential detections with area close to area of the whole image
                detections = detections[(detections.area / (image.shape[0] * image.shape[1])) < 0.9]
                # Drop potential double detections
                detections = detections.with_nms()
                # Create the Pascal VOC XML annotation for this image
                xml_annotation = create_pascal_voc_xml(image_filename=image_name, image_shape=image.shape, boxes=detections.xyxy, labels=[class_name])
                # Save the Pascal VOC XML annotation to a file
                xml_filename = os.path.join(annotated_dir, f"{os.path.splitext(image_name)[0]}.xml")
                with open(xml_filename, "w") as xml_file:
                    xml_file.write(xml_annotation)
                # Save the annotated image
                annotated_image_path = os.path.join(annotated_dir, image_name)
                cv2.imwrite(annotated_image_path, image)
# Annotate images in the dataset directory
annotate_images_in_directory(DATASET_DIR)

Conclusion

In summary, Grounding DINO revolutionizes the way we approach image annotation, simplifying a process that’s traditionally been both time-consuming and complex. This guide has shown you how to harness the power of Grounding DINO for efficient open-set object detection, streamlining your projects in AI and machine learning. As the field evolves, tools like Grounding DINO are set to play a pivotal role in bridging the gap between visual data and computer interpretation, enhancing both the development and application of machine learning models.

Don’t miss out on future articles covering AI, ML, security, and productivity subscribe to Teen Different for insights and updates that keep you at the forefront of technology.

Check out my repo:

References

  1. Anil, “Automating Image Annotation with GroundingDINO,” Medium, Link.
  2. IDEA-Research, “GroundingDINO GitHub Repository,” GitHub, Link.
  3. “Automating Image Annotation with GroundingDINO — YouTube Video,” YouTube, Link.

This content is not exclusively written using AI; it includes my own work and learning as the author.

Deep Learning
Object Detection
Image Processing
Image Annotation
Automation
Recommended from ReadMedium