Basic Principles of Recurrent Neural Networks (RNN) and the Vanishing Gradient Problem

Basic Principles of Recurrent Neural Networks (RNN) and the Vanishing Gradient Problem

Problem Description
A Recurrent Neural Network (RNN) is a specialized neural network architecture designed for processing sequential data. Unlike standard feedforward neural networks, RNNs possess "memory" capability, enabling them to utilize information from previous steps to process the current input. However, standard RNNs suffer from a well-known challenge in practice—the vanishing or exploding gradient problem—which makes it difficult for them to learn long-term dependencies within long sequences. Please explain the basic working principles of RNNs and provide an in-depth analysis of the causes and impacts of the vanishing gradient problem.

Knowledge Explanation

Step 1: Basic Structure and Working Principle of RNN

Core Idea: Traditional neural networks assume all inputs are independent. However, in many tasks (such as natural language processing and speech recognition), input data is a sequence where elements are interrelated. The core idea of an RNN is to capture this dependency between sequence elements by introducing "recurrence" or a "hidden state."
Unrolled Structure: For easier understanding, we can "unroll" an RNN along the time dimension. Suppose we have an input sequence (x_0, x_1, ..., x_t, ...).
- At each time step t, the RNN receives two inputs:
  - The input data for the current time step x_t.
  - The Hidden State h_{t-1} from the previous time step. This hidden state can be seen as the network's "memory" up to the current moment.
- At each time step t, the RNN computes two outputs:
  - The hidden state for the current time step h_t.
  - An optional actual output o_t (e.g., the predicted next word).
Forward Propagation Process: The RNN reuses the same set of parameters (W, U, V) at each time step.
- Hidden State Calculation: h_t = \tanh(W \cdot h_{t-1} + U \cdot x_t + b)
  - W is the weight matrix connecting the previous hidden state h_{t-1} to the current hidden state h_t.
  - U is the weight matrix connecting the current input x_t to the current hidden state h_t.
  - b is the bias term.
  - tanh is the activation function (commonly tanh or ReLU), which compresses output values to the range (-1, 1), helping to stabilize gradients.
- Output Calculation: o_t = V \cdot h_t + c
  - V is the weight matrix connecting the current hidden state h_t to the output o_t.
  - c is the bias term for the output layer.
- Through this structure, h_t contains all historical information from the start of the sequence (x_0) to the current moment (x_t). Theoretically, an RNN can utilize arbitrarily long historical information.

Step 2: Causes of the Vanishing/Exploding Gradient Problem

RNNs learn parameters via the Backpropagation Through Time (BPTT) algorithm. The problem arises during the BPTT process.

Loss Function: For a sequence task, the total loss L is typically the sum of losses at all time steps, i.e., L = Σ L_t.
Key to BPTT: The Chain Rule: To update a parameter (e.g., W), we need to compute the gradient of the loss L with respect to the parameter W, ∂L/∂W. According to the chain rule, this gradient can be decomposed into a sum of contributions from each time step.
- ∂L/∂W = Σ_{t} ∂L_t/∂W
- The gradient at each time step ∂L_t/∂W itself needs to be backpropagated from time step t all the way back to time step 0. For example, the gradient of L_t with respect to W depends on h_t, which in turn depends on h_{t-1} and W, and so on until h_0.
Multiplication of Gradients: This backpropagation process causes the gradient ∂L_t/∂W to contain a product of a sequence of Jacobian matrices. Specifically, it's the product of partial derivatives of the hidden state with respect to previous hidden states:
- ∂h_t/∂h_k = Π_{i=k+1}^{t} (∂h_i/∂h_{i-1}), where k < t.
- This multiplicative term is a key part of computing ∂L_t/∂W.
The Problem Emerges: Eigenvalues of the Jacobian: The magnitude of each Jacobian matrix ∂h_i/∂h_{i-1} depends on the derivative of the activation function (e.g., tanh). The derivative of tanh ranges between (0, 1].
- When the eigenvalues of these Jacobian matrices are consistently less than 1, the result of the multiplication approaches zero at an exponential rate. This is the Vanishing Gradient problem. Vanishing gradients mean that gradients from distant time steps (k is very small) become almost zero, so the parameter W is hardly updated by information from these early time steps, preventing the RNN from learning long-term dependencies.
- Conversely, if the eigenvalues of the Jacobian matrices are consistently greater than 1, the result of the multiplication grows explosively at an exponential rate. This is the Exploding Gradient problem. Exploding gradients cause parameter updates to be excessively large, making the model unable to converge.

Step 3: Impacts of the Vanishing/Exploding Gradient Problem

Impact of Vanishing Gradients: The model becomes "forgetful" or "short-sighted." It can only learn short-term patterns in the sequence and fails to utilize important information from the beginning. For example, to predict the last word "sky" in the sentence "The clouds are in the sky," the model needs to remember the long-term dependency on the opening word "clouds." Vanishing gradients make it difficult for the model to establish such connections.
Impact of Exploding Gradients: The training process becomes extremely unstable. The loss value may oscillate wildly or even become NaN (Not a Number), causing training to fail completely.

Summary
RNNs cleverly handle sequential data through recurrent connections and hidden states. However, the multiplicative effect of gradients inherent in their training process (BPTT) makes them highly susceptible to vanishing or exploding gradient problems, thereby limiting their ability to process long sequences. To address this issue, more complex gated RNN architectures were subsequently developed, such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRU). These models effectively mitigate the vanishing gradient problem through ingeniously designed "gates" that control the flow and forgetting of information.