Applications of Zero-Shot Learning

As a member of a research group involved in computer vision, I wanted to write this short article to briefly present what we call “Zero-shot learning” (ZSL), an interesting variant of transfer learning, and the current research related to it.

Today, many machine learning methods focus on classifying instances whose classes have already been seen in training. Concretely, many applications require classifying instances whose classes have not been seen before. Zero-shot learning is a promising learning method, in which the classes covered by training instances and the classes we aim to classify are disjoint. In other words, Zero-shot learning is about leveraging supervised learning with no additional training data.

Zero-shot learning refers to a specific use case of machine learning (and therefore deep learning) where you want the model to classify data based on very few or even no labeled example, which means classifying on the fly.

Let’s think of how Convolution Neural Networks (CNN) work — they break down the general tasks of e.g. image recognition into a sequence of smaller tasks carried out by successive layers where each layer works on increasingly complex features.

When we train a network to recognize a given picture, for instance, a human, we have already also trained it to recognize arms, legs, face, etc. Thanks to this, we can re-use those feature detectors and rearrange them to perform some other task without additional training.

In other words, zero-shot learning is about leveraging deep learning networks already trained by supervised learning in other ways, without additional supervised learning.

Zero-shot learning could yield extremely interesting applications, especially where we lack proper datasets. As you may know, the lack of data is a huge issue in almost all computer vision projects. If I had to sum up ZSL in a few words, I’d say that it is:

Pattern recognition without training examples
Based on semantic transfer

Natural Scarcity of Data

Zero-shot learning is an ability that humans already have. Indeed, we can learn a lot of things with just “minimal dataset”. For instance, you tend to differentiate different varieties of the same fruit (fine-grained classification) from others or similar-looking fruits (regular classification) with a few numbers of pictures of each type of fruit. The situation is different for machines... They need a lot of images to learn to adapt to the variance that occurs naturally.

This natural ability comes from our existing language knowledge base, which provides a high-level description of a new or unseen class and makes a connection between it and seen classes and visual concepts.

Why do we need Zero-Shot Learning?

As you may know, there is a large and growing number of categories in many domains. As a consequence, it is difficult to collect a lot of annotated data per category.

In some projects, the number of classes can be in thousands, and obtaining sufficient training data for each class is complex. Zero-shot learning aims at predicting a large number of unseen classes using only labeled data from a small set of classes and external knowledge about class relations. Moreover, the number of categories keeps increasing as well as the difficulty to collect new data for each new category. It is especially true in Deep Learning where you need a lot of data…

Different varieties of the same object can quickly become a nightmare and unsupervised learning can’t be applied to help in this situation.

Furthermore, in a normal object recognition process, we have to determine a certain number of object classes to enhance our accuracy as well as collect as many sample images as possible for selected object classes. Moreover, these sample images should contain elements taken from different angles in various environments in order to enrich a dataset.

In some cases, labeling can only be achieved by an expert. Fine-grained object recognition tasks like recognition of specific species can be considered as examples of labeling under the supervision of an expert.

There is an increasing interest in machine ZSL for scaling up visual recognition.

How does it work

Without getting too much into details, Zero-shot learning relies on the existence of a labeled training set of seen classes and unseen class. Both seen and unseen classes are related in a high dimensional vector space, called semantic space, where the knowledge from seen classes can be transferred to unseen classes.

Zero-shot learning approaches are designed to learn intermediate semantic layer, their attributes, and apply them at inference time to predict a new class of data.

Usually, zero-shot learning algorithms first map instances to intermediate attributes, which can be seen classes (those with labeled data), human-specified or data-dependent attributes. Then the predicted attributes are mapped to a large number of unseen classes through the knowledge bases. In this way, the prediction of unseen classes become possible and no training data is required for those classes.

Zero-shot learning is a two-stage process: training and inference. In the training stage, knowledge about the attributes is captured, and in the inference stage, this knowledge is used to categorise instances among a new set of classes. It seems like many efforts have been made to improve the training stage whereas the inference stage has received little attention. For example, many approaches are incapable of fully exploiting the discriminative capacity of attributes, and cannot harness the uncertainty of the attribute prediction obtained in the first stage.

Research

From a research perspective, I have seen teams working on more accurate ZSL model that uses neural net architectures called generative adversarial networks (GANs) to read and analyze text from the web, and then visually identify the objects they describe. This new approach enables systems to classify objects based on category, and then use that information to identify other similar objects.

Another important element benefiting from the research is bias. Indeed, the collection and labeling of training data can be very time-consuming, and because it remains difficult to gather enough statistically diverse training images, unlabeled target classes (i.e. images or objects that have not been seen before), are often categorized as labeled source classes, which results in a poor accuracy in generalized settings.

When there are few training images available, existing object recognition models struggle to make correct predictions, and ZSL was developed principally as a means to fight this problem.

Thanks to our research, we managed to build a prototype that can recognize species by analyzing related web articles. Looking at only those text descriptions (without seeing an image of the species) the system extracts key features, such as the shape of the animal’s head. The system can then somehow imagine what the species looks like, generating a synthetic visual model.

It is important to say that the result of image and text understanding doesn’t eliminate the need for training, but it’s an example of how ZSL can reduce training and help systems be accurate when confronted with unexpected data.

As ZSL continues to develop, I expect to see more applications such as better recommendations and more advanced solutions that automatically flag bad content within categories on social media. I also envision a strong development of ZSL in the robotics field.

The Zero-Shot learning method is similar to human vision in many ways, therefore it can be used in robot vision. Instead of performing recognition on a limited set of objects, using Zero-Shot learning it is possible to recognize every object.

I have no dounts that ZSL could help transition AI away from today’s limited applications and toward the kind of versatility that’s so natural for humans.

For more information, I recommend this video: - https://www.youtube.com/watch?v=jBnCcr-3bXc&t=626s