avatarSalvatore Raieli

Summary

Researchers from New York University and the University of Maryland have developed a method to visualize and understand the inner workings of Vision Transformers (ViTs), revealing their ability to effectively utilize background information and demonstrating similarities and differences with Convolutional Neural Networks (CNNs) in feature representation and learning.

Abstract

The article discusses the advancements in understanding Vision Transformers (ViTs), which have become prominent in computer vision. Unlike CNNs, ViTs are based on self-attention mechanisms and have been challenging to interpret due to their complexity. The researchers have introduced a novel approach to visualize what ViTs learn by focusing on the feed-forward layer rather than the self-attention layer. This method has unveiled that ViTs preserve spatial information and exhibit progressive specialization, where simpler structures are learned in earlier layers, and more sophisticated patterns are developed in deeper layers. The study also highlights that ViTs make efficient use of background information in images, which is a departure from CNNs. Furthermore, the research extends to models like CLIP, which are trained with language supervision, showing that ViTs can learn semantic and conceptual features. The findings suggest that ViTs' success may be partly attributed to their superior handling of background-related information and their ability to learn spatial relationships during training.

Opinions

  • The authors express surprise at how ViTs maintain local representations despite each patch being able to influence the representation of every other patch, indicating that ViTs learn to preserve spatial information without the inductive bias present in CNNs.
  • The last layer of ViTs is noted to have a uniform activation pattern and is primarily responsible for globalizing information and classifying the image, unlike earlier layers that focus on spatial relationships.
  • The authors suggest that the CLS token in ViTs plays a minor role throughout the network, becoming significant only in the last layer for globalization of information.
  • The research indicates that ViTs trained with language model supervision, such as CLIP, learn more semantic and conceptual features, moving beyond object-specific visual features.
  • The paper emphasizes the importance of interpretability in models, noting that while methods exist for CNNs, visualizing features of ViTs was previously not possible, which has now been addressed by the authors' approach.
  • The authors provide code and resources for the community to further explore and understand machine learning and artificial intelligence, demonstrating a commitment to open science and collaboration.

A Visual Journey in What Vision-Transformers See

How some of the largest models see the world

image from the original article: source

Visualizing CNN's allowed us to learn more about how these models work. Now that Vision Transformers are taking the stage, a new article explains how we can see what these broad models see the world as.

Visualize the vision transformers

image from the original article: source

Since convolution neural networks (CNN) have emerged as a winning model in computer vision, different research groups have focused on understanding what these models learn.

On the one hand, neural networks have emerged in several fields (from language analysis to computer vision) but have been considered “black boxes.” In contrast to many other algorithms, they are much more difficult to interpret. In fact, the more capable the models become (growth in the number of parameters), the more difficult it becomes to be able to understand what is going on inside.

Therefore, several methods have been developed to visualize what a convolutional neural network learns. Some of the most used:

  • Visualize the filters (or visualize the weights).
  • Visualize layer activation
  • To retrieve an image that maximally activates a neuron
  • Embedding the feature vectors with t-SNE.
  • GradCAM, saliency maps.

In 2016, transformers appeared on the scene. These wide models based on self-attention have been shown to achieve much superior performance in NLP (machine translation, language classification, and so on). Soon, they became the standard for NLP, and with the introduction of vision transformers, they were also applied to computer vision.

from the original transformer article: here

Therefore different researchers have tried to visualize what vision transformers (ViTs) learn. ViTs have proven to be much more difficult to analyze, and so far, the methods used have shown limitations. Understanding the inner workings of these models could be helpful in explaining their success and potential corner cases.

Previous work had focused on observing the activation of keys, queries, and values from the self-attention layer, but the result was unsuccessful.

Visualizing the self-attention weights it is not leading to insightful visualization. caption and image from the original article: source

A paper has recently been published by researchers at New York University and the University of Maryland that provides a better understanding of what happens inside the model (whether they are vision transformers or models such as CLIP).

In the article, the researchers summarize their contribution:

  1. While standard methods lead to uninterpretable results (especially when applied to keys, queries, and values), it is possible to obtain informative visualizations by applying the same techniques to the next feed-forward layer of the same transformer block (and they demonstrated this using different models: ViTs, DeiT, CoaT, ConViT, PiT, Swin, and Twin transformers).
  2. Patch-wise image activation patterns for ViT features behave like saliency maps demonstrating that the model preserves positional relationships between patches (and learns this during training).
  3. CNN's and ViTs construct a complex and progressive representation (in CNNs, the first layers represent edges and textures, while later layers learn more complex patterns, and the authors show that the same happens in ViTs). ViTs, in contrast to CNN's are better able to use background information.
  4. The authors also applied their method to models using language supervision (such as CLIP) and showed that features could be extracted from these models that are associable with caption text (such as prepositions, adjectives, and conceptual categories).

The authors compared ViTs to convolutional networks and noted that the representation increases in complexity along the pattern (earlier layers learn simpler structures while more sophisticated patterns are learned by more advanced layers). In practice, both CNN and ViTs share what is called progressive specialization.

“The progression for visualized features of ViT B-32. Features from early layers capture general edges and textures. Moving into deeper layers, features evolve to capture more specialized image components and finally concrete objects.” caption and image from the original article: source
“Complexity of features vs depth in ViT B-32. Visualizations suggest that ViTs are similar to CNNs in that they show a feature progression from textures to parts to objects as we progress from shallow to deep features.” caption and image from the original article: source

There are also differences. The authors investigated the reliance of ViTs and CNNs on background and foreground image features (using bounding boxes on ImageNet). ViTs are able to detect background information present in the image (in the image, for example, grass and snow). In addition, by masking the background or foreground in the image the researchers showed that ViTs not only use the background information better but are also less affected by its removal.

“ ViT-B16 detects background features. Left: Image optimized to maximally activate a feature from layer 6. Center: Corresponding maximally activating example from ImageNet. Right: The image’s patch-wise activation map. (b): An example of an original image and masked-out foreground and background.” caption and image from the original article: source

We find it surprising that even though every patch can influence the representation of every other patch, these representations remain local, even for individual channels in deep layers in the network. While a similar finding for CNNs, whose neurons may have a limited receptive field, would be unsurprising, even neurons in the first layer of a ViT have a complete receptive field. In other words, ViTs learn to preserve spatial information, despite lacking the inductive bias of CNNs. -source: original article

In other words, during training, the model learns how to preserve spatial information. In addition, the last layer instead has a uniform activation pattern and learns how to classify the image (according to the authors, the last layer has the function of globalizing information).

Based on the preservation of spatial information in patches, we hypothesize that the CLS token plays a relatively minor role throughout the network and is not used for globalization until the last layer.

“ Example feature visualization from ViT feed-forward layer. Left: Image optimized to maximally activate a feature from layer 5. Center: Corresponding maximally activating ImageNet example. Right: The image’s patch-wise activation map. (b): A feature from the last layer most activated by shopping carts.” caption and image from the original article: source

In recent years, vision transformer models have been trained with language supervision and contrastive learning techniques. One example of all is CLIP. Because these models are increasingly used and increasingly competitive, the authors also analyzed CLIP.

‘Left: Feature optimization shows sharp boundaries, and maximally activating ImageNet examples contain distinct, adjacent images. Middle: Feature optimization and maximally activating ImageNet photos all show images from an elevated vantage point. Right: Feature optimization shows a crowd of people, but maximally activating images indicate that the repetition of objects is more relevant than the type of object.’ caption and image from the original article: source

The model shows that there are features related to conjectures, such as “before and after” or “from above.” In other words, there are features that represent conceptual categories and are clearly discernible:

The corresponding seven highly activating images from the dataset include other distinct objects such as bloody weapons, zombies, and skeletons. From a strictly visual point of view, these classes have very dissimilar attributes, indicating this feature might be responsible for detecting components of an image relating broadly to morbidity.

“ Features from ViT trained with CLIP that relates to the category of morbidity. Topleft image in each category: Image optimized to maximally activate a feature from layer 10. Rest: Seven of the ten ImageNet images that most activate the feature.” caption and image from the original article: source

Conclusions

To understand, seeing is always better. In recent years there has been an increasing emphasis on the need for the interpretability of models. While there are many worked methods on CNNs, being able to visualize the features of ViTs was not possible.

The authors not only identified a method to be able to do this (they showed that one had to use the feed-forward layer and not the self-attention layer) but also analyzed the properties of these features. They showed how the model is capable of learning spatial relationships during training and how, on the other hand, the last layer does not participate in this spatial representation.

Furthermore, although ViTs are similar to convolutional networks, part of their success for the authors is derived from how they make better use of background-related information. They also show that when ViTs are trained with d with language model supervision, they learn more semantic and conceptual features rather than object-specific visual features.

Code: here, article: here

if you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

Or feel free to check out some of my other articles on Medium:

Artificial Intelligence
Machine Learning
Technology
Science
Data Science
Recommended from ReadMedium