Zero-Shot Learning
A year ago, I just heard about Zero-Shot learning and searched the Internet to find out more about it. Unfortunately, it was not possible to find any useful material around which aims to explain the topic plain and simple back then because it was fairly new research topic. There were research papers mostly focusing on technical aspects and only a couple brief explanations around. Still, it has not changed much. So I’ve decided to share my experience/knowledge here, instead of hiding it to myself, after a year of conducting a relevant project on the subject.
What is Zero-Shot Learning?
Zero-Shot learning method aims to solve a task without receiving any example of that task at training phase. The task of recognizing an object from a given image where there weren’t any example images of that object during training phase can be considered as an example of Zero-Shot Learning task. Actually, it simply allows us to recognize objects we have not seen before.
Why do we need Zero-Shot Learning?
In conventional object recognition process, it is necessary to determine a certain number of object classes in order to be able to do object recognition with high success rate. It is also necessary to collect as many sample images as possible for selected object classes. Of course, these sample images should contain objects taken from diverse angles in various contexts/environments in order to be comprehensive. Although there exists lots of object classes that we can effortlessly gather sample images of, there also exists cases that we are not always so lucky.
Imagine that we want to recognize animals that are on the edge of extinction or live in extreme environments (in the depths of the ocean/jungle or hard to reach mountain peaks) that humans are not able to visit whenever they wish. It is not easy to collect sample images of these sort of animals. Even if you would achieve to collect enough images, remember images should not be similar and they should be as unique as possible — . You need to make a lot of effort to achieve that.

In addition to the difficulty of recognizing different object classes with a limited number of images, labeling for some object classes is not as easy as ordinary people can do. In some cases, labeling can only be done after the subject is truly mastered or in the presence of an expert. Fine grained object recognition tasks like recognition of fish species or tree species can be considered as examples of labelling under the supervision of an expert. An ordinary person will call/label all the tree she/he is viewed as tree or all the fish she/he is viewed as fish. These are obviously true answers but imagine that you want to train a network in order to recognize tree or fish species. In that case, all aforementioned true answers are useless and you need an expert to help you with labelling task. Again, you need to make a lot of effort to achieve that.

The research paper titled Fine-Grained Object Recognition and Zero-Shot Learning in Remote Sensing Imagery is one of the interesting practical studies about the subject where, in the paper, trees are recognized and classified to species only using their aerial or satellite images which are hard to make sense of but easy to collect when compared to walking around a huge area to take the pictures of trees and label them.
Let’s Get Started
Now that after mentioning what Zero-Shot learning is, let’s implement a Zero-Shot learning model step by step. But before we do that, let’s elaborate our approach.
Approach
We have training and zero-shot classes. Remember that no samples from zero-shot classes will be used during training.Then, how on earth the model trained with training objects will perform recognition on zero-shot objects? In simple terms, how is it possible to recognize objects that have never seen before?
As we all know, to be able to apply any machine learning technique, we should represent data with reasonable features. We should use two data representations and one of the representations should play an auxiliary role. Therefore, we come up with image embedding and class embedding — as auxiliary representation — as our two representations.
Image embedding is nothing special. It is a feature vector extracted from an image using a convolutional network. Convolutional network can be implemented from scratch or a pre-trained convolutional network that had already proven its success, can be used. We will use a pre-trained convolutional model — VGG16 — for image feature extraction process.
Remember that we have training and zero-shot classes. We collect image samples for training classes and naturally, we can get image embeddings for all these image samples. However, we don’t have any image sample for zero-shot classes — we don’t know how they look like — and it is not possible to get image embeddings for zero-shot classes. This is where zero-shot learning method varies from traditional methods. At this point, we need another data representation which will function as a bridge between training and zero-shot classes. This data representation should be extracted from all data samples ignoring that they belongs to training classes or zero-shot classes. Because of that, instead of focusing image itself, we should focus class label which is a common property for all data samples.
Class embedding is the vector representation of a class (class label). It is a representation which we can easily access for each class of objects beside their image representations. We will us Google’s Word2Vecs as class embeddings which will allow us to represent words — class labels — as vectors. In Word2Vec space, two vectors are most likely to be positioned closely if two words — represented with two mentioned vectors — tend to be appear together in same documents or have semantic relations.

In the example figure above, it can be easily observed that the vectors of classes/words related with eatable objects (indicated with white and turquoise boxes) tend to appear together positionally. However they tend to appear distant from the vectors of classes/words related with body parts (indicated with bright green boxes).

To summarize, for training classes, we have both their image samples and class labels, therefore we have both their image embeddings and class embeddings. However, for zero-shot classes, we only have their class labels — we have never seen any image sample — , therefore we only have their class embeddings. It can be seen much more clearly by looking at the figure on the left side.
At the end of the day, what we simply want to do is this; we will use the image embeddings (image feature vectors) and their related class embeddings (word Word2Vecs) for training classes. This way, the network will basically learn how to map a given input image to a vector located in the Word2Vec space. After training is done, when an image of an object belonging to the zero-shot classes is given to the network, we will be able to obtain a vector as output. Then, by using this output vector (measuring its distance to all class vectors that we have — both training and zero-shot — ), we will be able to perform classification.
Data Collection
As a first job, we need to collect image data which are required during the training phase and at the evaluation phase to measure the Zero-Shot performance after training. I collected data from Visual Genome and decided to use 20 classes in total where there are 15 classes selected for training and 5 classes selected as Zero-Shot classes.
Then, we should determine which object classes are to be selected as training classes and which are to be selected as Zero-Shot classes. For ease of illustration, it will be much more suitable to recognize daily objects instead of preforming and selecting proper classes for a fine-grained object recognition task.












