The provided content discusses ViLBERT, a model designed to learn joint representations of image and text, which is pivotal for various Vision-and-Language tasks, emphasizing its task-independent nature and the process of setting up the environment, preparing data, fine-tuning, and extracting visiolinguistic representations.
Abstract
ViLBERT is a state-of-the-art model that extends the BERT architecture to handle both visual and linguistic data simultaneously. It introduces co-attentional transformer layers that allow for the interaction between modalities at different representation depths. The model is pre-trained on the Conceptual Caption dataset and can be fine-tuned for specific tasks, demonstrating its versatility in applications such as visual question answering, visual commonsense reasoning, and caption-based image retrieval. The content provides a comprehensive tutorial on implementing ViLBERT, including detailed instructions for environment setup, dataset preparation, model fine-tuning, and extraction of visiolinguistic embeddings. The tutorial is based on the original implementation by the authors and includes practical advice on resolving potential conflicts during the setup process.
Opinions
The author emphasizes the importance of multi-modal machine learning, suggesting that restricting to a single modality results in a loss of valuable information.
ViLBERT's approach to joint image-text representation learning is seen as a significant advancement, as it allows for a unified model capable of handling multiple tasks without task-specific modifications.
The author shares their personal experience, highlighting the challenges faced in setting up the environment for the VILBERT project, which took nearly two days to resolve.
The tutorial underscores the practical aspects of implementing ViLBERT, including the necessity of extracting image features using an object detection model and the process of preparing datasets for training and evaluation.
The author provides insights into the model's architecture and functionality, particularly how the final representations for image and text are obtained and used for downstream tasks.
The conclusion of the content encourages reader engagement and feedback, indicating the author's commitment to community discourse and the dissemination of knowledge in the field of multi-modal machine learning.
ViLBERT, a model for learning joint representations of image and text
Many real-world applications don’t involve only one data modality. Web pages, for example, contain text, images, videos, etc. Restricting oneself to using only one modality would involve losing all the information contained in the others. Multi-modal machine learning aims to build models that can process and relate information from multiple modalities. In this tutorial, we focus on two main modalities: written text as linguistic and image as visual signals.
Joint image-text representation is the bedrock for many Vision-and-Language tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. This enables a wide range of applications, such as visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval.
Figure 1. an example of different Vision-and-Language tasks [1].
In recent years different approaches have been proposed to learn the joint representation of image and language, however, they are mainly task-specific models rather than a single unified model. This means the model that understands questions cannot ground noun phrases, the grounding model cannot retrieve images based on a description, and so forth. While individual tasks present different challenges and diverse interfaces, the underlying associations between language and visual concepts are often common across tasks. For example, learning to ground the referring expression “small red vase” requires understanding the same concepts as answering the question “What color is the small vase?”. Training multiple tasks jointly can potentially pool these different sources of grounding supervision [1].
In this tutorial, I explain VILBERT (short for Vision-and-Language BERT) [1], a joint model for learning task-independent visual grounding from paired visiolinguistic data. I start with a brief theoretical explanation, then I go through step-by-step of setting up the environment, preparing the data, fine-tuning the pre-trained VILBERT model, and finally, I discuss how we can use this model to extract the visiolinguistic representations for image (or text). This tutorial is based on the code implemented by the authors of ‘ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks’ [1], [2] https://github.com/facebookresearch/vilbert-multi-task.
VILBERT
This model extends the recently developed BERT [3] language model to jointly reason about text and images. The key technical innovation, as it is shown in figure 2, is introducing separate streams for vision and language processing that communicate through co-attentional transformer layers. This structure can accommodate the different processing needs of each modality and provides interaction between modalities at varying representation depths.
Figure 2. VILBERT architecture [1]
Given an image represented as a set of region features v1, …, vT and a text input represented as a set of tokens w0, …, wT , VILBERT outputs final representations hv0, …, hvT and hw0, …, hwT, where hv0 and hw0 are the holistic representations of the image and text.
Figure 3. ViLBERT model consists of two parallel streams for visual (green) and linguistic
(purple) processing that interact through novel co-attentional transformer layers (Co-TRM) [1]
Pre-trained models: Initially, ViLBERT was pre-trained on the Conceptual Caption dataset [4], and later on, it further (jointly) fine-tuned on 6 different tasks with different datasets (the goal was to build a vision and language model which is task-independent).
This tutorial comprises four main parts:
setting-up the environment
preparing the datasets and extracting the features
fine-tuning and evaluating the model
extracting the visiolinguistic embeddings from the image-text data
# maskrcnn_benchmark and coco api dependencies
pip install ninja yacs cython matplotlib
# follow PyTorch installation in https://pytorch.org/get-started/locally/# we give the instructions for CUDA 9.0
conda install pytorch-nightly -c pytorch
# install PyTorch Detectioncd ~/github
git clone https://github.com/facebookresearch/maskrcnn-benchmark.git
cd maskrcnn-benchmark
# the following will install the lib with# symbolic links, so that you can modify# the files if you want and won't need to# re-build it
python setup.py build develop
If you have followed the above steps without problem, you have successfully created the perfect environment for the VILBERT project. I personally struggled a bit to set up my environment, it took me almost two days to resolve all the conflicts, but hopefully, it won’t take that much for you.
Dataset Preparation
downloading the prepared datasets
If you want to train and test the model on the existing VILBERT project datasets, there is no need to extract the features from the raw images. You can download the datasets used by the paper from:
# create a folder named data in your project directorycd data
# create a folder named datasets1 in your project directorycd datasets1
wget https://dl.fbaipublicfiles.com/vilbert-multi-task/datasets.tar.gz
tar xf datasets.tar.gz
In this tutorial, the task we are interested in is Caption-Based Image Retrieval and the dataset we work on is flickr30k. After downloading the datasets, inside the flickr30k folder, you would find several folders and files:
1) *.lmbd folder: it contains the image features extracted using a pre-trained object detection network. There are two lmbd folders inside the flickr30 folder, for two types of features that are extracted from the images. We would explain it further in the next section where we extract the features from raw images.
2) *.jsonlines file which contains the captions along with the image ids. Each line in this file is a dictionary with four items: sentences (caption), id, and image path.
3) *.pkl file which contains are the hard negative samples.
So, what are the hard negatives? for training (or fine-tuning) the model, we need negative examples in addition to the positive image-caption pairs. In VILBERT paper, authors train the model in a 4-way multiple-choice setting by randomly sampling three distractors for each image-caption pair — substituting a random caption, a random image, or a hard negative from among the 100 nearest neighbors of the target image. The hard-negatives are selected off-line and they are fixed [2].
Extracting the features from raw images
To know how the image features are generated, here is the explanation from the VILBERT paper [1]:
Image region features are generated by extracting bounding boxes and their
visual features from a pre-trained object detection network (We use Faster R-CNN to extract region features). Unlike words in text, image regions lack a natural ordering. Therefore, we encode spatial location instead, constructing a 5-d vector from region position and the fraction of image area covered. This is then projected to match the dimension of the visual feature and they are summed. We mark the beginning of an image region sequence with a special IMG token representing the entire image.
First, we download theoriginal flickr30k dataset (not the extracted features as above) into data/rawdatasets/flickr30k directory (you can download the dataset from https://www.kaggle.com/hsankesara/flickr-image-dataset). The flickr30k dataset includes a folder of images and a file which contains all the caption for images:
Then, we download the trained object detection model and its config file:
cd data
wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_model.pth
wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_config.yaml
Now that we’ve got the images and the captions and the object detection model, we can extract the image features by calling extract_features.py:
points to where your image directory is, for us is data/rawdatasets/flickr30k/images, and is where you want to save the image features, which we set it as data/rawdatasets/flickr30k/image_features.
Then we have to convert the extracted images to an LMDB file:
Now you should have a directory named flickr30k.lmdb which contains two files — data.mdb and lock.mdb
At this point, we have extracted the image features. The next step is to prepare the jsonline file with captions and image paths. Well, we have the csv file of the captions, the only thing we need to do is to convert it to a proper jsonline format by running the following code:
The flickr30k folder would look like this: captions.csv, images (the original flickr30k image), image_features (extracted image features), flickr30k.lmdb, and the flickr30k.jsonline. We further split the jsonline file into train and test sets.
The files that we would use for training and testing the models are flickr30k.lmdb, flickr30k_train.jsonlines, and flickr30k_test.jsonlines.
How about the hard negatives (.pkl file)? as it has explained in the paper, hard-negative pairs can be chosen by the same approach applied in this paper [], however, in this tutorial we skip that, and we use a random image-text pair instead.
If you take a look at the VILBERT Github repository, you would see that there are lots of files and folders. Here, I would take you through the most important classes of this project that we need for fine-tuning a Caption-Based Image Retrieval task.
In addition to these files, there are a few config files that need to be set:
- vilbert_tasks.yml
- vilbert/dataset/__init__.py
There is also a config folder which contains different configuration for the VILBERT model, and for the BERT model which is responsible for extracting the text representation. We choose the default configurations for VILBERT and BERT model (bert_base_6layer_6conect.json and bert-base-uncased_weight_name.json, respectively), but you can try and test the other configurations. We also get to choose which model to fine-tune on, the baseline model which is a single stream model, or VILBERT which a two-stream model (we would choose between them later).
Let’s start with the code. We fine-tune the model by calling the train_tasks.py. Some of the main arguments that need to be set are listed below:
We set the path to the image features and jsonline files in vilbert_tasks.yml. Some other arguments can be set in this file, such as the maximum sequence length of the caption and the maximum number of region features per image.
In vilbert/datasets/__init__.py we set the classes which are responsible for loading the datasets for each task. Here, we have only one task (RetrievalFlickr30k), and respectively, we specify RetrievalDataset and RetrievalDatasetVal as the classes for loading the training and validating datasets:
What the train_tasks.py does, is to first load the train and validation datasets through the LoadDatasets function in task_utils.py:
LoadDatasets function call the RetrievalDataset class from retrieval_datasets.py to prepare the train and eval datasets and dataloaders:
In RetreivalDataset class, the image features and captions would be prepared, by loading the hard negative samples (if they exist) and tokenizing the caption sentences. Then they would be saved in a cache directory.
After loading the datasets in train_tasks.py, it calls an intermediate function, ForwardModelsTrain. This function receives the features and calls the vilbert model in vilbert.py:
There is an important argument here, ‘output_all_encoded_layers’. You can set that if you wish to get the output of all the hidden layers, but by default, it returns only the output of the last layer.
The main part of the forward function is calling the bert model (BertModel) which returns five outputs: sequence_output_t, sequence_output_v, pooled_output_t, pooled_output_v, all_attention_mask. The first two are the output of the hidden layer(s) (the last layer or all the layers if you set output_all_encoded_layers=True) for text tokens and image features. pooled_output_t and pooled_output_v are the output of an extra dense linear layer on top of the last layer. pooled_output_t and _v are then used to calculate the loss.
We would explain how to use the other two outputs to get the visiolinguistic embedding of the image-text data in the last part of this tutorial.
Now that we have set up the arguments and explained how the code works, we can run the train_tasks.py to fine_tune the VILBERT model on our dataset:
You can also set the ‘frequency_iter’ to evaluate the model after a specific number of iterations, before training on all the samples at the end of each epoch.
The fine-tuned model would be saved in a save directory that you specified, and we would evaluate it in the next section.
Evaluating the model
Similar to the fine-tuning process, some arguments should be set before calling the eval_retrieval.py:
as you can see, we set the ‘from_pretrained’ argument to the path that we saved our fine-tuned model:
The last part of this tutorial is about using the VILBERT model to get the visiolinguistic embedding of the image-text data. It is very similar to how we get the BERT embedding from text data.
First, we modify the VILBertForVLTasks class, by adding a function to get the embedding for the input data:
There are two important arguments: output_all_encoded_layers, and layerno. We set the first argument as True to get the output of all the hidden layers of VILBERT model (it could be 2,4,6, or 8 layers depending the chosen config, here we selected the 6 layer base config). Layerno specifies which layer you want to get the embeddings from, the default is the last layer (layerno=-1).
As we explained at the very start of this tutorial, VILBERT outputs final representations hv0,…, hvT and hw0,…, hwT, where hv0 and hw0 are the holistic image and text representations. And accordingly, we return them as the visiolinguistic embedding for the caption and image.
So, to obtain the visiolinguistic embedding, we first set some arguments in identify_vilbert_emds.py:
Same as fine-tuning and evaluating, we have to set the ‘bert_model’, ‘from_pre-trained’, and the ‘config_file’ arguments. You can choose between the existing pre-trained VILBERT model (multi_task_model.bin, or pretrained_model.bin), or your own fine-tuned model to get the embeddings from. We have also created a file vilbert_transfer_tasks.yml to specify some parameters:
A new class is also created to load the dataset, ‘RetreivalDatasetTrans’, and then vilbert/dataset/__init__.py is modified to include that:
That’s all from me folks. I hope you enjoyed the post and hopefully got a clearer picture around VILBERT, and how you can extract the features, fine-tune and evaluate, and get the visiolinguistic embeddings. Feel free to post your feedback or questions in the comments section.
If you liked my article, please give it a clap :)
References:
[1] Lu, J., Batra, D., Parikh, D. and Lee, S., 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265.
[2] Lu, J., Goswami, V., Rohrbach, M., Parikh, D. and Lee, S., 2020. 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10437–10446).
[3] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[4] Sharma, P., Ding, N., Goodman, S. and Soricut, R., 2018, July. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2556–2565).