A model that smokes everyone else
I didn’t pay too much attention when I heard about CLIP, a new neural network that learns visual concepts from natural language supervision. At some point, however, my social media feed was all about it. I mostly have to blame this guy. I had to look into it 👀 .
It immediately felt like Christmas in March (CLIP was released in January, but I was late to the party.) Here you had this new technique that kicked everyone’s butts with a zero-shot approach!
Let’s try to unpack this a little bit.
What in the world is “zero-shot”?
If your model can predict classes that you didn’t see during training, you have a zero-shot-capable model.
For example, you might have heard of ImageNet, a 14+ million image dataset organized in more than 22,000 categories. CLIP can correctly classify those images with 76.2% accuracy without training on that dataset and its classes.
A specific example, straight from Wikipedia:
(…), given a set of images of animals to be classified, along with auxiliary textual descriptions of what animals look like, an AI which has been trained to recognize horses, but has never seen a zebra, can still recognize a zebra if it also knows that zebras look like striped horses.
Mic-drop, mind-blown moment. Take a minute and try to appreciate this.
Think about this: zero-shot capabilities indicate that the model is learning to relate visual concepts to categories at a much deeper level than what we have seen.
This changes the game, and here is why
There are three key advantages of CLIP over existing supervised techniques:
- Putting together a good dataset and labeling it is a pain in the rear end. We don’t need this with CLIP, and I can’t describe my happiness because of it.
- Even if we collect and label a good dataset, existing models don’t generalize very well outside of that. That’s not the case with CLIP, which we can use for all sorts of tasks unrelated to a specific dataset.
- And the cherry 🍒 on top is that CLIP’s real-world performance is consistent with its performance in vision benchmarks. Just in case you didn’t know, most of the current deep learning models do much better with toy problems than out in the wild. This sucks, but CLIP takes care of it.
This is a big deal! Not in the “oh-wow-we-just-discovered-something-that-will-be-useful-someday” sense, but more in the “holy-crap-we-can-use-this-now-and-it’s-awesome” way.
Well, this all dandy. Now what?
Following their steps with GPT-3, OpenAI didn’t publish the full model, but just a smaller version. They also warned about using this in production, citing the need for more specific tests and potential bias in the model.
This hasn’t stopped a lot of cool experiments with CLIP. Vladimir put together a notebook to do image searches on the Unsplash dataset. And here is another notebook for generating images using CLIP and BigGAN courtesy of Ryan Murdock.
But of course, cool samples aren’t real applications. I’m sure we’ll start seeing them pop up in the coming weeks and months. Even the smaller CLIP version is powerful enough to be valuable, and I can’t wait to see what people do with it.
In the meantime, I took the model, containerized and put a RESTful API around it so you can deploy it and use it anywhere. You can give it online images and ask it to select the best label that represents each one of them. Even this smaller version is pretty impressive!
Too long; didn’t read.
CLIP is new. CLIP is awesome. CLIP is mind-blowing stuff.
Hordes of state-of-the-art computer vision models seem now arcane thanks to CLIP. Can’t wait to see where this goes and what people build with it.
The future of computer vision is bright.
A model that smokes everyone else” was originally posted in Issue #2 of underfitted.io.






