This context provides a step-by-step guide to understanding Graph Attention Networks (GATs), a type of Graph Neural Network (GNN) that enhances learning capacity through the attention mechanism, assigning different importance to each neighbor's contribution.
Abstract
The context begins by introducing GNNs and their ability to improve high-impact problems in various fields. It then delves into the concept of recursive neighborhood diffusion, a principle that allows each graph node to receive and aggregate features from its neighbors to represent the local graph structure. The focus of the context is on GATs, which expand the basic aggregation function of the GCN layer by assigning different importance to each edge through the attention coefficients. The context also provides a NumPy implementation of the GAT layer equations, breaking down the process into linear transformation, edge representation, attention coefficients, and neighborhood aggregation.
Opinions
GNNs are a powerful toolbox for learning from graph data, driving improvements for high-impact problems in different fields.
Unlike other types of data, learning from graph data requires specific methods, such as recursive neighborhood diffusion.
The attention mechanism in GATs improves learning capacity by assigning different importance to each neighbor's contribution.
The NumPy implementation of the GAT layer equations provides a clear understanding of the process, from linear transformation to neighborhood aggregation.
The context suggests that the mechanisms behind the Multi-Head GAT Layer and its applications for the link prediction task will be discussed in a future article.
The original paper on Graph Attention Networks and a video from Aleksa Gordić are recommended for a deep explanation of the topic.
The context encourages further readings on Graph Representation Learning through a series of articles.
Graph Neural Networks (GNNs) have emerged as the standard toolbox to learn from graph data. GNNs are able to drive improvements for high-impact problems in different fields, such as content recommendation or drug discovery. Unlike other types of data such as images, learning from graph data requires specific methods. As defined by Michael Bronstein:
[..] these methods are based on some form of message passing on the graph allowing different nodes to exchange information.
For accomplishing specific tasks on graphs (node classification, link prediction, etc.), a GNN layer computes the node and the edge representations through the so-called recursive neighborhood diffusion (or message passing). According to this principle, each graph node receives and aggregates features from its neighbors in order to represent the local graph structure: different types of GNN layers perform diverse aggregation strategies.
The simplest formulations of the GNN layer, such as Graph Convolutional Networks (GCNs) or GraphSage, execute an isotropic aggregation, where each neighbor contributes equally to update the representation of the central node. This blog post is dedicated to the analysis of Graph Attention Networks (GATs), which define an anisotropy operation in the recursive neighborhood diffusion. Exploiting the anisotropy paradigm, the learning capacity is improved by the attention mechanism, which assigns different importance to each neighbor's contribution.
If you are completely new to GNNs and related concepts, I invite you to read the following introductory article:
This warm-up is based on the GAT details reported by the Deep Graph Library website.
Before understanding the GAT layer's behavior, let’s recap the math behind the aggregation performed by the GCN layer.
GCN Layer — Aggregation Function
N isthe set of the one-hop neighbors of node i. This node could also be included among the neighbors by adding a self-loop.
c is a normalization constant based on the graph structure, which defines an isotropic average computation.
σ is an activation function, which introduces non-linearity in the transformation.
W is the weight matrix of learnable parameters adopted for feature transformation.
The GAT layer expands the basic aggregation function of the GCN layer, assigning different importance to each edge through the attention coefficients.
GAT Layer Equations
Equation (1) is a linear transformation of the lower layer embedding h_i, and W is its learnable weight matrix. This transformation helps achieve a sufficient expressive power to transform the input features into high-level and dense features.
Equation (2) computes a pair-wise un-normalized attention score between two neighbors. Here, it first concatenates the z embeddings of the two nodes, where || denotes concatenation. Then, it takes a dot product of such concatenation and a learnable weight vector a. In the end, a LeakyReLU is applied to the result of the dot product. The attention score indicates the importance of a neighbor node in the message passing framework.
Equation (3) applies a softmax to normalize the attention scores on each node’s incoming edges. The softmax encodes the output of the previous step in a probability distribution. As a consequence, the attention scores are much more comparable across different nodes.
Equation (4) is similar to the GCN aggregation (see the equation at the beginning of the section). The embeddings from neighbors are aggregated together, scaled by the attention scores. The main consequence of this scaling process is to learn a different contribution from each neighborhood node.
NumPy Implementation
The first step is to prepare the ingredients (matrices) to represent a simple graph and perform the linear transformation.
NumPy code
Output
----- One-hot vector representation of nodes. Shape(n,n)
The first matrix defines a one-hot encoded representation of the nodes (node 1 is shown in bold). Then, we define a weight matrix, exploiting the defined embedding dimension. I have highlighted the 3rd column vector of W because, as you will see shortly, this vector defines the updated representation of node 1 (a 1-value is initialized in the 3rd position). We can perform the linear transformation to achieve sufficient expressive power for node features starting from these ingredients. This step aims to transform the (one-hot encoded) input features into a low and dense representation.
The next operation is to introduce the self-attention coefficients for each edge. We concatenate the representation of the source node and the destination node's representation for representing edges. This concatenation process is enabled by the adjacency matrix A, which defines the relations between all the nodes in the graph.
In the previous block, I have highlighted the 4 rows representing the 4 in-edges connected to node 1. The first 3 elements of each row define the embedding representation of node 1 neighbors, while the other 3 elements of each row define the embeddings of node 1 itself (as you can notice, the first row encodes a self-loop).
After this operation, we can introduce the attention coefficients and multiply them with the edge representation, resulting from the concatenation process. Finally, the Leaky Relu function is applied to the output of this product.
At the end of this process, we achieved a different score for each edge of the graph. In the upper block, I have highlighted the evolution of the coefficient associated with the first edge. Then, to make the coefficient easily comparable across different nodes, a softmax function is applied to all neighbors' contributions for every destination node.
As you can see, instead of having 1 values to define edges, we rescaled the contribution of each neighbor. The final step is to compute the neighborhood aggregation: the embeddings from neighbors are incorporated into the destination node, scaled by the attention scores.
In the future article, I will describe the mechanisms behind the Multi-Head GAT Layer, and we will see some applications for the link prediction task.
References
The running version of the code is available in the following notebook. You will also find a DGL implementation, which is useful to check the correctness of the implementation.
The original paper on Graph Attention Networks from Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengiois available on arXiv.
For a deep explanation of the topic, I also suggest the video from Aleksa Gordić.