Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

r equivalently, the q,k inner product is equivalent to another function g that takes only the token embeddings and their positions as input?<figure id="1e8d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*k94LENp-BGi7CLnAMMVn6g.png"><figcaption></figcaption></figure>This is where the RoPE position embedding comes into play.<h1 id="1248">Intuition of RoPE: A 2D Simple Case</h1>The authors begin by considering a simple 2D case, where token embeddings and attention vectors (query, key) all reside in 2D space. For convenience, these 2D vectors can also be represented using a complex number (as shown in the figure).<figure id="295a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vPQreHexP93c6rRp4Je1aw.png"><figcaption>Token embeddings in the 2D condition</figcaption></figure>The counterclockwise rotation matrix can be expressed both in a matrix form and in an exponential form.<figure id="3697"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*zdWnv6GULwHO6r0lbYJuQg.png"><figcaption>Counterclockwise rotation in the 2D condition</figcaption></figure>Similarly, we can represent the projection from token embeddings to key or query vectors using 2D matrices.<figure id="aedc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*IMt8C-RZ2Tp8N4Ww_7Rccw.png"><figcaption></figcaption></figure>The authors discover that one possible solution (i.e., the transformations of f and g) that satisfies the following condition:<figure id="93c5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*zcKFCyDscC5ogyg-mvWCyQ.png"><figcaption></figcaption></figure>has the following form:<figure id="e29f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UPIrsYMxPOOE8k2vcI324A.png"><figcaption></figcaption></figure>or graphically:<figure id="adf1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*wzBCcWf3cNhGXDVYxAZ3ag.png"><figcaption>A solution in the 2D case, represented graphically.</figcaption></figure>In simple terms, this means that after the transformation, we can either rotate first and then perform the inner product, or we can perform the inner product first and then rotate, and take the real part. In the second approach, we only need (m–n) for the rotation, which signifies that this is a type of relative position embedding.This is the intuition behind Rotary Position Embedding (RoPE): simply rotate the affine-transformed word embedding vector by an angle proportional to its position index.<h1 id="62d9">The General Form of RoPE</h1>To generalize into the. d dimentional case, let’s consider how rotation matrix would look like. The authors suppose that d is an even number, and thus divide d into d/2 blocks. Each block performs a 2D rotation respectively:To generalize this into the d-dimensional case, consider how the rotation matrix would look like. The authors fisrt assume that d is an even number, and thus can be divided

Options

into d/2 blocks. Each block performs a 2D rotation independently:<figure id="beba"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*__p2WDBbzc8ZAPD9na3wkg.png"><figcaption></figcaption></figure>Now, the question is: how much do we rotate (in each block? Recalling <a href="https://readmedium.com/exploring-classic-position-embeddings-in-attention-based-models-3680bc2b8591">how sinusoidal position embeddings are applied in transformers</a>, the angle parameter they use is:<figure id="ac87"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*HFEtxCXiBJRxxt5CyB1FTg.png"><figcaption></figcaption></figure>Following this implementation, the authors adopt a similar parameter:<figure id="310a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ngHd5GO-gnqplwRp1dumHQ.png"><figcaption></figcaption></figure>Here, i represents the i-th sub-block, and m·θ determines the rotation for the corresponding sub-block. Thus, the general form of the rotation matrix is:<figure id="17a8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*2h1UrdFgcbE_U7wiJ-39ow.png"><figcaption></figcaption></figure>And the overall transformation applied to a token embedding is:<figure id="5d92"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ih7zFZqrQcMK5BRmSJskjQ.png"><figcaption></figcaption></figure>where W is the d-dimensional affine transformation for either the query or the key vector, and R is the rotation matrix mentioned above. Below is a graphical explanation from the original paper.<figure id="a36f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*f-QBvEGcXNULLz2qWrB6ww.png"><figcaption></figcaption></figure>Note that the rotation matrix R is quite sparse, hence direct multiplication is not efficient. Instead, a more computationally efficient realization of the R multiplication looks like this:<figure id="b7ea"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*1SqGKgYMTdv9bbZhy8gX-g.png"><figcaption></figcaption></figure>Here, the operator with a circle and a cross (⊗) denotes the element-wise (Hadamard) product.<h1 id="8812">How RoPE Improves Language Models</h1>In the original RoPE paper, the authors validate its performance by replacing BERT’s original sinusoidal position encoding with RoPE during pre-training, resulting in a model they call ReFormer. During pre-training, the masked language modeling (MLM) loss shows that BERT with RoPE converges more quickly.<figure id="9f71"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*F1QvM7N9thd6OzjqTyaWJQ.png"><figcaption></figcaption></figure>After pretraining, the authors fine-tune the weights of the pre-trained RoFormer on various GLUE tasks (NLP tasks) to assess its capabilities for downstream NLP tasks, and RoFormer outperforms BERT on 3 out of the 6 datasets.<figure id="8658"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jjuB-qrocaBKp3ggWE06dQ.png"><figcaption></figcaption></figure></article></body>

RoPE: A Detailed Guide to Rotary Position Embedding in Modern LLMs

Rotary Position Embedding (RoPE) has been widely applied in recent large language models (LLMs) to encode positional information, including Meta’s LLaMA and Google’s PaLM.

Position is crucial in sequential models, and position embedding plays a vital role in transformer-based architectures. RoPE, the rotary position embedding, use a clever method to incorporate both relative and absolute positional information.

| Free-reading link

Rethink the Attention Product

Before introducing RoPE, let’s recap the basics of the attention mechanism. Attention focuses on pair-wise relationships: there’s a query vector q from one token and a key vector k from another. We obtain the attention score by taking the inner product of q and k, and this inner product is key to how position embeddings function.

For example, to get the attention score for the pair (1, 3), we get the query vector from token 1 and the key vector from token 3.

We obtain the query vector q1 by first extracting its token embedding through the token encoder. Then, we feed this embedding and its positional information into the position+attention encoder, which integrates position information and projects the result to produce the key vector.

We perform a similar process for the third token to obtain k3, the key vector corresponding to token 3.

Finally, we take the inner product of q1 and k3 to determine the attention score for (1, 3). In the equation below, angle brackets <> denote the inner product, x represents the token embedding, and f is the attention+position encoder.

The authors then reflect on this formulation and realize that in this setup, the relative positional information is encoded before the inner product — meaning it’s inherently tied to the token embedding.

They ask themselves: “Is there another way to encode relative positional information only when we need the attention score — i.e., at the moment we perform the q,k inner product?” Or equivalently, the q,k inner product is equivalent to another function g that takes only the token embeddings and their positions as input?

This is where the RoPE position embedding comes into play.

Intuition of RoPE: A 2D Simple Case

The authors begin by considering a simple 2D case, where token embeddings and attention vectors (query, key) all reside in 2D space. For convenience, these 2D vectors can also be represented using a complex number (as shown in the figure).

The counterclockwise rotation matrix can be expressed both in a matrix form and in an exponential form.

Counterclockwise rotation in the 2D condition

Similarly, we can represent the projection from token embeddings to key or query vectors using 2D matrices.

The authors discover that one possible solution (i.e., the transformations of f and g) that satisfies the following condition:

has the following form:

or graphically:

A solution in the 2D case, represented graphically.

In simple terms, this means that after the transformation, we can either rotate first and then perform the inner product, or we can perform the inner product first and then rotate, and take the real part. In the second approach, we only need (m–n) for the rotation, which signifies that this is a type of relative position embedding.

This is the intuition behind Rotary Position Embedding (RoPE): simply rotate the affine-transformed word embedding vector by an angle proportional to its position index.

The General Form of RoPE

To generalize into the. d dimentional case, let’s consider how rotation matrix would look like. The authors suppose that d is an even number, and thus divide d into d/2 blocks. Each block performs a 2D rotation respectively:

To generalize this into the d-dimensional case, consider how the rotation matrix would look like. The authors fisrt assume that d is an even number, and thus can be divided into d/2 blocks. Each block performs a 2D rotation independently:

Now, the question is: how much do we rotate (in each block? Recalling how sinusoidal position embeddings are applied in transformers, the angle parameter they use is:

Following this implementation, the authors adopt a similar parameter:

Here, i represents the i-th sub-block, and m·θ determines the rotation for the corresponding sub-block. Thus, the general form of the rotation matrix is:

And the overall transformation applied to a token embedding is:

where W is the d-dimensional affine transformation for either the query or the key vector, and R is the rotation matrix mentioned above. Below is a graphical explanation from the original paper.

Note that the rotation matrix R is quite sparse, hence direct multiplication is not efficient. Instead, a more computationally efficient realization of the R multiplication looks like this:

Here, the operator with a circle and a cross (⊗) denotes the element-wise (Hadamard) product.

How RoPE Improves Language Models

In the original RoPE paper, the authors validate its performance by replacing BERT’s original sinusoidal position encoding with RoPE during pre-training, resulting in a model they call ReFormer. During pre-training, the masked language modeling (MLM) loss shows that BERT with RoPE converges more quickly.

After pretraining, the authors fine-tune the weights of the pre-trained RoFormer on various GLUE tasks (NLP tasks) to assess its capabilities for downstream NLP tasks, and RoFormer outperforms BERT on 3 out of the 6 datasets.