Summary

DeBERTa (Decoding-enhanced BERT with disentangled attention) is an advanced language model that improves upon BERT and RoBERTa by incorporating a disentangled attention mechanism, an enhanced mask decoder, and a novel virtual adversarial training method called SiFT.

Abstract

DeBERTa introduces significant enhancements to the BERT and RoBERTa models through two key innovations: the disentangled attention mechanism and the enhanced mask decoder. The disentangled attention mechanism separately encodes the content and position of each word, allowing for more nuanced attention weight calculations based on both content and relative position. This approach is inspired by the observation that the attention weight between word pairs is influenced by both their semantic content and their positions relative to each other. Additionally, DeBERTa employs an improved mask decoder that incorporates absolute position embeddings at the decoding stage, enhancing the model's ability to predict masked tokens during pre-training. The model also benefits from a new virtual adversarial training method known as Scale Invariant Fine Tuning (SiFT), which improves generalization and stability during fine-tuning. The DeBERTa v2 iteration further refines the model with a larger vocabulary, an additional convolution layer for local dependency learning, shared position and content projection matrices, log bucket encoding for relative positions, and larger model sizes, resulting in improved performance on downstream tasks.

Opinions

The authors of DeBERTa recognize the importance of both content and relative position in word pair attention, which is a departure from previous models like BERT.
The use of an enhanced mask decoder in DeBERTa is seen as a superior method for incorporating absolute positions, as it does so at the decoding stage rather than in the input layer.
The introduction of SiFT suggests a belief that normalizing word embedding vectors during fine-tuning can significantly enhance model performance, especially for larger models.
The updates in DeBERTa v2, such as the use of a sentencepiece-based tokenizer and larger vocabulary, indicate a commitment to continuous improvement and adaptation based on empirical evidence and the latest research findings.
The addition of larger model sizes in DeBERTa v2 reflects an opinion that scaling up model parameters can lead to better results in downstream tasks.

Papers Explained 08: DeBERTa

DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques.

The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively.
Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training.
In addition, a new virtual adversarial training method is used for fine-tuning to improve models’ generalization.

Disentagled Attention

Unlike BERT where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions.

For a token at position i in a sequence, we represent it using two vectors, {Hi} and {Pi,j}, which represent its content and relative position with the token at position j, respectively. The calculation of the cross attention score between tokens i and j can be decomposed into four components as:

That is, the attention weight of a word pair can be computed as a sum of four attention scores using disentangled matrices on their contents and positions as content-to-content, content-to-position, position-to-content, and position-to-position.

Taking single-head attention as an example, the standard self-attention operation can be formulated as:

Denote k as the maximum relative distance, δ as the relative distance from token i to token j, which is defined as:

The disentangled self-attention with relative position bias as can be represented as

Enhanced Mask Decoder

DeBERTa incorporates absolute word position embeddings right before the softmax layer where the model decodes the masked words based on the aggregated contextual embeddings of word contents and positions.

There are two methods of incorporating absolute positions. The BERT model incorporates absolute positions in the input layer. In DeBERTa, these are incorporated right after all the Transformer layers but before the softmax layer for masked token prediction. In this way, DeBERTa captures the relative positions in all the Transformer layers and only uses absolute positions as complementary information when decoding the masked words. Thus, we call DeBERTa’s decoding component an Enhanced Mask Decoder (EMD).

Scale Invariant Fine Tuning

SiFT is a new virtual adversarial training algorithm that improves the training stability by applying the perturbations to the normalized word embeddings.

Specifically, when fine-tuning DeBERTa to a downstream NLP task in the experiments, SiFT first normalizes the word embedding vectors into stochastic vectors, and then applies the perturbation to the normalized embedding vectors. It was found that the normalization substantially improves the performance of the fine-tuned models. The improvement is more prominent for larger DeBERTa models.

DeBERTa v2

Vocabulary In v2 the tokenizer is changed to use a new vocabulary of size 128K built from the training data. Instead of a GPT2-based tokenizer, the tokenizer is now sentencepiece-based tokenizer.

nGiE(nGram Induced Input Encoding) The DeBERTa-v2 model uses an additional convolution layer aside with the first transformer layer to better learn the local dependency of input tokens.

Sharing position projection matrix with content projection matrix in attention layer Based on previous experiments, this can save parameters without affecting the performance.

Apply bucket to encode relative positions The DeBERTa-v2 model uses log bucket to encode relative positions similar to T5.

900M model & 1.5B model Two additional model sizes are available: 900M and 1.5B, which significantly improves the performance of downstream tasks.

Source: https://huggingface.co/docs/transformers/model_doc/deberta-v2

Paper

DeBERTa: Decoding-enhanced BERT with Disentangled Attention 2006.03654

Hungry for more insights?

Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!

Do Subscribe for weekly updates!!