Detailed Explanation of the Principles of Graph Attention Networks (GAT) in Graph Neural Networks (GNN)

Detailed Explanation of the Principles of Graph Attention Networks (GAT) in Graph Neural Networks (GNN)

Title/Knowledge Point Description
Graph Attention Network (GAT) is a graph neural network model based on the attention mechanism. It employs a self-attention mechanism, allowing each node in the graph to adaptively and differentially aggregate information from its neighboring nodes, without relying on a pre-defined graph structure (such as node degree). The core idea of GAT is to assign a learnable attention weight to each neighbor node, thereby more effectively capturing important relationships between nodes.

Problem-Solving Process/Principle Explanation

1. The Fundamental Problem in Graph Neural Networks
In traditional Graph Convolutional Networks (GCN), the aggregation of node features is typically homogeneous. For example, for a central node, its neighboring nodes contribute equally to information aggregation (or the importance is determined solely by node degree). However, in real-world graph structures, the importance of different neighboring nodes to the central node often varies. The goal of GAT is to address this heterogeneity and achieve more refined information aggregation.

2. Calculation of Attention Coefficients
The first step in GAT is to compute an attention coefficient for each pair of adjacent nodes, indicating the importance of neighbor node j to central node i.

Input: Initially, each node i has an initial feature vector, denoted as \(\mathbf{h}_i\).
Linear Transformation: To enhance the model's expressive power, we first apply a shared weight matrix \(\mathbf{W}\) to linearly transform the features of all nodes:
\(\mathbf{z}_i = \mathbf{W} \mathbf{h}_i\)
This maps the node features into a new, potentially higher-dimensional feature space.
Attention Mechanism: Then, we compute the attention score \(e_{ij}\) between node i and each of its neighboring nodes j (including node i itself). GAT uses a single-layer feedforward neural network a to compute this score, whose weight is a vector \(\overrightarrow{\mathbf{a}}\).
\(e_{ij} = \text{LeakyReLU}\left( \overrightarrow{\mathbf{a}}^T [\mathbf{z}_i \, \Vert \, \mathbf{z}_j] \right)\)
Here, \(\Vert\) denotes the vector concatenation operation. LeakyReLU is a nonlinear activation function ensuring the score can be a non-zero real number. This score \(e_{ij}\) is unnormalized and directly reflects the "raw" importance of node j to node i.

3. Normalization of Attention Weights
The computed raw attention scores \(e_{ij}\) often vary significantly in magnitude and scale. To facilitate comparison and stabilize the model, we need to normalize them so that the sum of attention weights for all neighbors affecting node i (including i itself) equals 1. GAT uses the softmax function for normalization.

Softmax Normalization: The final attention weight \(\alpha_{ij}\) of node j for node i is obtained via softmax:
\(\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k \in \mathcal{N}(i)} \exp(e_{ik})}\)
where \(\mathcal{N}(i)\) denotes the set of neighboring nodes for node i, typically including node i itself (ensuring the model considers its own features during aggregation). This formula ensures \(\sum_{j \in \mathcal{N}(i)} \alpha_{ij} = 1\) and \(0 \le \alpha_{ij} \le 1\).

4. Information Aggregation and Output
After obtaining the normalized attention weights, we can perform a weighted sum of the transformed features of the neighboring nodes to obtain the new feature representation for node i.

Weighted Sum: The new feature \(\mathbf{h}_i'\) of node i is the weighted sum of the transformed features \(\mathbf{z}_j\) of all its neighboring nodes (including itself), with the weights being the computed \(\alpha_{ij}\):
\(\mathbf{h}_i' = \sigma\left( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \mathbf{z}_j \right)\)
Here, \(\sigma\) is a nonlinear activation function, such as ELU or ReLU.

5. Multi-head Attention
To stabilize the learning process of the attention mechanism and enhance the model's expressive power (similar to the Transformer model), GAT typically employs multi-head attention.

Principle: We independently perform the above attention calculation K times (i.e., K heads). Each computation yields a different set of attention weights and a new node feature representation.
Aggregation Methods: For the final output node features, two common aggregation methods are used:
1. Concatenation (often used in intermediate layers): Concatenate the K feature vectors obtained from the K heads.
  \(\mathbf{h}_i' = \Vert_{k=1}^{K} \sigma\left( \sum_{j \in \mathcal{N}(i)} \alpha_{ij}^k \mathbf{W}^k \mathbf{h}_j \right)\)
2. Averaging (often used in the output layer): Average the K feature vectors obtained from the K heads.
  \(\mathbf{h}_i' = \sigma\left( \frac{1}{K} \sum_{k=1}^{K} \sum_{j \in \mathcal{N}(i)} \alpha_{ij}^k \mathbf{W}^k \mathbf{h}_j \right)\)
  Multi-head attention allows the model to jointly attend to information from different representation subspaces, thereby capturing richer relationships.

Summary
Graph Attention Networks (GAT) achieve differential information aggregation of neighboring nodes in a graph by introducing a self-attention mechanism. Its core steps include: 1) Enhancing feature expressiveness through linear transformation; 2) Computing raw attention scores between nodes using a simple neural network; 3) Normalizing the scores with softmax to obtain attention weights; 4) Updating node representations via a weighted sum of neighbor features based on the weights; 5) (Optional) Using multi-head attention to enhance model robustness and performance. GAT does not rely on the complete graph structure, making it suitable for inductive learning tasks, and it can implicitly assign different importance to different neighbors, representing a significant advancement in the field of graph neural networks.