G-DINO Paper for Dummies
This blog outlines all the important concepts used in the methodology of this paper.
What does G-DINO do?, Explained in English
Input: A text prompt and an image
What does it do? It takes text input prompts and is able to identify the “arbitrary object” in the input image by drawing a bounding box around all objects associated with input prompt.
Output: An image with all identified “arbitrary objetcs” marked by respective bounding boxes.
Applications
You could use text prompts to create bounding boxes around objects or regions of interest within a given input image. This will accelerate annotation and you can create a custom dataset with certain classes of interest and use those to train your own supervised detection system.
How did this idea pop up?
Previous works mostly evaluated OD models on novel categories, G_DINO also performs evaluations on referring expression comprehension for objects specified with attributes.
Novel categories in object detection typically refer to objects or classes that were not present in the training dataset but need to be detected when encountered in real-world scenarios
Also, following up on GLIP model, which reformulated object detection as a phrase grounding task and introduced contrastive training between object regions and language phrases. However, GLIP can be constrained since it is designed based on a traditional one-stage detector Dynamic Head.
GLIP Grounded Language-Image Pre-training. The task of identifying correspondences between single tokens in a text-prompt and objects or regions in an image is called phrase grounding.
The authors wanted to create a powerful system that can find any objects described in human words. They named this approach “open-set object detection”.
Since closed-set and open-set detection are closely connected, having a stronger closed-set object detector can also improve the open-set detector.
Closed set detection can be read as closed set classification. It essentially means that if the model was trained with 4 classes, it will only be able to detect/classify out of those 4 classes.
Open set, hence is being able to classify new categories on which the model did not necessarily train on. Unlike closed set, even if the model isn’t trained to detect class :dogs it will be able to output dogs as a class on a test image.
Advantages of GDINO over GLIP:
- Its Transformer based architecture is similar to language models, making it easier to process both image and language data
- Transformer-based detectors have demonstrated a superior capability of leveraging large scale datasets.
- DINO can be optimized end-to-end without using any hard-crafted modules such as NMS(Non-Maximum Suppression), which greatly simplifies the overall grounding model design.
Open-set detector is based on DINO. Most existing open-set detectors are developed by extending closed-set detectors to open-set scenarios with language information.
Key to open-set detection is introducing language for unseen object generalization
This is important because the primary reason why combining modalities(image and text) is to introduce the capability of being able to “query” using words as inputs to detect arbitrary objects. This requires the classes to be open set for object detection
A closed-set detector typically has three important modules, a backbone for feature extraction, a neck for feature enhancement, and a head for region refinement (or box prediction). A closed-set detector can be generalized to detect novel objects by learning language-aware region embeddings so that each region can be classified into novel categories in a language-aware semantic space.
G-DINO Architecture
It contains:
- Image backbone for image feature extraction
- Text backbone for text feature extraction.
- Feature enhancer for image and text feature fusion, a language-guided query selection module for query initialization and a cross-modality decoder for box refinement
It is a dual encoder single decoder architecture. Each encoder encodes different types of input data(text, image), and the decoder generates output data based on the information encoded by both encoders.
For each (Image, Text) pair, we first extract vanilla image features and vanilla text features using an image backbone and a text backbone, respectively.
Vanilla Features: Extracting vanilla features refers to the process of obtaining fundamental and basic characteristics or representations from an image. These features are typically low-level and do not capture high-level semantic information about the contents of the image. Instead, they focus on capturing simple visual elements. For example in images vanilla features would be: edges, textures, colors, and shapes, that can be used for various computer vision tasks.And in text vanilla features would be bag of words, TF-IDF, word embeddings, N-grams etc.
After obtaining these features, they use “a language-guided query selection module to select cross-modality queries from image features”
This can also be read as “this module helps you find images that match what you describe using language. It understands your words, looks at the features of images, and picks the ones that fit your description the best”.
Here’s what it means step by step:
- The Query: Language Input: It starts with some words or sentences that describe what you want to find in images. For example, you might type or say something like, “Find pictures of big red apples.”
- Vanilla text feature extraction: Understanding the Language: The module understands the words you used and figures out what you’re looking for. In our example, it understands that you want pictures of “big red apples.”
- Selecting the Right Image Features: Now, it looks at a bunch of pictures, which are represented as features, like shapes, colors, and patterns. Imagine these features as descriptions of each picture.
- Matching Language to Features: The module compares what you asked for (big red apples) with the features of each picture. It tries to find pictures that match your description.
- Selecting the Best Matches: Based on how well the pictures match what you asked for, it selects the best pictures and shows them to you. These are the pictures of big red apples that you were looking for.
Feature Enhancer Block:
Deformable self-attention (DSA) for enhancement of the traditional self-attention mechanism. It is designed to improve the ability of models to capture fine-grained details and local patterns in images.
In a standard self-attention mechanism, an input image is divided into patches or regions, and each patch is represented as a query, key, and value. These queries, keys, and values are used to calculate attention scores, determining how much each patch should “pay attention” to other patches in the image. Traditional self-attention relies on a fixed regular grid structure for the patches, meaning that each patch looks at other patches in a fixed, grid-like pattern. Instead of a fixed grid, DSA allows each patch to dynamically adjust its attention to other patches in the image. This enables the model to focus more on informative regions and adjust its attention based on the content of the image.
In DSA, the key innovation is the introduction of learnable offsets. For each query in a patch, the model learns how to deform (shift) the position of the key patches that it attends to. These learnable offsets allow the model to adaptively change its attention pattern for different patches.
Vanilla self attention for text feature enhancers
Cross-Modality Decoder
This part of the architecture is used to combine image and test modality features.
Here the DINO decoder layer has been fine tuned and improved by supplementing each decoder layer with an extra text cross-attention. This will enable to inject text information into queries for better modality alignment.
Sub-Sentence Level Text Feature
Another opportunity to improve the model was by fine tuning the text prompts features to extract maximum signal by retaining per word signal fine-grained understanding, at the same time also ensuring that there is no unwanted word interactions among unrelated categories.
They names this sub-sentence level representation, since is it meets sentence level representation and word level representation in the middle.
Fin.