The web content provides a comprehensive guide to using OpenAI's CLIP model for multi-modal machine learning, demonstrating how it can understand and translate between text and images through vector embeddings.
Abstract
The article titled "Quick-fire Guide to Multi-Modal ML With OpenAI’s CLIP" introduces the Contrastive Language-Image Pretraining (CLIP) model, which is capable of understanding the relationships between text and images. CLIP consists of two parallel models, a Vision Transformer or ResNet for image embeddings and a transformer for language embeddings, both producing 512-dimensional vector embeddings. The training process involves minimizing the contrastive loss between the embeddings of matched text-image pairs, allowing CLIP to map similar pairs into a shared vector space. The guide also includes practical steps for using the Hugging Face implementation of CLIP, including loading data, creating text and image embeddings, and performing text-image searches. It concludes with a demonstration of how CLIP can accurately retrieve an image of "a dog in the snow" from a dataset of 100 images, showcasing its potential for various applications in language-image domains.
Opinions
The author suggests that CLIP's ability to connect text and images is a significant advancement in machine learning, as it mimics a skill that humans acquire at an early age.
The article conveys that CLIP's capabilities go beyond a mere parlor trick, emphasizing its practical applications and effectiveness in understanding complex concepts in both text and visual form.
The author expresses a preference for the Hugging Face implementation of CLIP, noting its use of a Vision Transformer model over the alternative ResNet setup.
The guide promotes the use of dot product similarity for comparing vector embeddings, advocating for normalization of vectors to ensure accurate comparisons.
The author encourages further exploration of multi-modal models, NLP, and vector search, providing links to additional resources such as their YouTube channel, Discord community, and free courses.
The article implies that while CLIP is powerful for searching within smaller datasets, scaling up to millions or billions of records would require a vector database, suggesting Pinecone as a potential solution.
The author endorses an AI service, ZAI.chat, as a cost-effective alternative to ChatGPT Plus (GPT-4), offering a special subscription rate.
Quick-fire Guide to Multi-Modal ML With OpenAI’s CLIP
Learn how to translate between text to image and back again with CLIP and vector embeddings
After a few short years of life, children can fathom the concepts behind simple words and connect them to related images. They can identify the connection between shapes and textures of the physical world to the abstract symbols of written language.
It’s something we take for granted. Very few (if any) people in the world will remember a time when these “basic” skills were beyond their capacity.
Computers are different. They can calculate the parameters a rocket needs to traverse the solar system. But if you ask a computer to find an image of “a dog in the park”, you’re better off asking NASA for a free ticket to the space station.
At least, that was the case until recently.
In this article, we’re going to take a look at OpenAI’s CLIP. A “multi-modal” model capable of understanding the relationships and concepts between both text and images. As we’ll see, CLIP is more than a fancy parlor trick. It is shockingly capable.
Contrastive Learning?
Contrastive Language-Image Pretraining (CLIP) consists of two models trained in parallel. A Vision Transformer (ViT) or ResNet model for image embeddings and a transformer model for language embeddings.
During training, (image, text) pairs are fed into the respective models, and both output a 512-dimensional vector embedding that represents the respective image/text in vector space.
The contrastive component takes these two vector embeddings and calculates the model loss as the difference (e.g., contrast) between the two vectors. Both models are then optimized to minimize this difference and therefore learn how to embed similar (image, text) pairs into a similar vector space.
After this contrastive pretraining process, we are left with CLIP, a multi-modal model capable of understanding both language and images via a shared vector space.
Using CLIP
OpenAI developed and released the clip library that can be found on GitHub here. However, Hugging Face’s transformers library hosts another implementation of CLIP (also built by OpenAI) that is more commonly used.
The Hugging Face implementation does not use ResNet for image encoding. It uses the alternative setup of a ViT model paired with the text transformer. We will learn how to use this implementation by stringing together a simple text-image search script that can be adapted for image-image, text-text, and image-text modalities.
Loading Data and CLIP
To begin, we will install the libraries needed for our demo, download the dataset, and initialize CLIP.
pip install -U torch datasets transformers
We will use the “imagenette” dataset, a collection of ~10K images hosted by Hugging Face.
That gives us 9469 images ranging from radios to dogs. All of these images are stored in the 'image' feature as PIL image objects.
Now we initialize CLIP via the transformers library like so:
A few things are happening here:
The whole device part is setting up our instance to use the fastest hardware available to us (MPS on M1 chips, CUDA otherwise).
We set the model_id. This is the name of the CLIP model found here.
Then we initialize a tokenizer for preprocessing text, a processor for preprocessing images, and the CLIP model for producing vector embeddings.
Now we’re ready to begin creating text and image embeddings.
Create Text Embeddings
The text transformer model handles the encoding of our text into meaningful vector embeddings. To do this, we first tokenize the text to translate it from human-readable text to transformer-readable tokens.
Then feed these tokens into the model using the get_text_features method.
Here we have a 512-dimensional vector representing the semantic meaning of the phrase “a dog in the snow”. This is one-half of our text-image search.
Create Image Embeddings
The next step is creating image embeddings. Again, this is very straightforward. We swap the tokenizer for a processor which will give us a resized image tensor called pixel_values.
We can still visualize the processed image. It has been resized, and the pixel “activation” values are no longer within the typical RGB range of 0–255 that Matplotlib can read, so colors are not displayed correctly. Nonetheless, we can see that it is the same Sony radio that we saw before.
Next, we process these inputs with CLIP, this time using the get_image_features method.
And with that, we built vector embeddings for text and image with CLIP. With these embeddings, we can compare their similarity using metrics like Euclidean distance, cosine similarity, or dot product similarity.
However, we can’t compare much with just a single example of each, so let’s move on and test this on a larger sample of images.
We will take 100 images at random from the imagenette data. To do this, we start by selecting 100 index positions at random and use them to build a list of images.
Now we iterate through these 100 images and create image embeddings with CLIP. We will add them all to a Numpy array called image_arr.
In the bottom cell, we can see that the minimum and maximum values in our image embedding are -7.99 and +3.15 respectively. We will be using dot product similarity to compare our vectors. If we want to compare them with dot product accurately, we need to normalize them. We do that like so:
Now we’re ready to compare and search through our vectors.
Text-Image Search
As mentioned, we will be using dot product to compare vectors. The text embedding will act as a “query” with which we will search for the most similar image embeddings.
We start by calculating the dot product similarity between our query and the images:
This gives us 100 scores, e.g., a one-to-many score for each text embedding to image embeddings pair. All we do know is to sort these scores in descending value and return the respective top-scoring images.
At position #1, we have a dog in the snow, a great result! This is very likely the only image of a dog in the snow from our sample of 100 images. That is how we perform a text-image search using CLIP.
CLIP is an amazing model that can be applied across the language-image domains in any order or combination. We can perform text-text, image-image, and image-text searches using the same methodology.
In fact, we can do all of those simultaneously by simply adding both image and text vectors to a single store and then querying with either image or text.
Our approach is great if you’re sticking with fewer search items. However, this is slow or even impossible when we begin searching through more records. To do that, we need a vector database. Allowing us to scale this to millions or even billions of records.
If you’re interested in learning more about multi-modal models, NLP, or vector search, check out my YouTube channel, reach out on Discord, or follow along with one of my free courses (links below).