Principles and Computational Steps of Backpropagation Algorithm

Principles and Computational Steps of Backpropagation Algorithm

Problem Description
Backpropagation is the core algorithm for training neural networks, used to efficiently compute the gradients of the loss function with respect to the network parameters. Using the chain rule, the algorithm propagates error signals layer by layer from the output layer back to the input layer, calculating gradients for each weight and bias, providing the foundation for gradient descent optimization.

1. Forward Propagation Process
Assume a neural network has L layers. The weighted input of layer l is z^(l) = W^(l)a^(l-1) + b^(l), and the activated output is a^(l) = σ(z^(l)) (where σ is the activation function). Input a^(0)=x, output a^(L)=ŷ.

Calculation Steps:
1. Input layer: a^(0) = x
2. Hidden layers: For l=1 to L-1, compute z^(l)=W^(l)a^(l-1)+b^(l), a^(l)=σ(z^(l))
3. Output layer: z^(L)=W^(L)a^(L-1)+b^(L), a^(L)=ŷ (e.g., Softmax output)
Loss Function: For example, cross-entropy loss J(ŷ, y) = -Σy_i log(ŷ_i)

2. Definition of Error Term and Output Layer Calculation
Define the error term for layer l as δ^(l) = ∂J/∂z^(l), representing the sensitivity of the loss to the weighted input.

Output Layer Error (Chain Rule):
δ^(L) = ∂J/∂z^(L) = (∂J/∂a^(L)) ⊙ (∂a^(L)/∂z^(L))
Taking Softmax + Cross-Entropy as an example: ∂J/∂a^(L) = (ŷ - y) (derivation omitted), ∂a^(L)/∂z^(L) is the Softmax Jacobian matrix, but it can be directly simplified to δ^(L) = ŷ - y.

3. Backpropagating the Error
Propagate the error from layer l+1 back to layer l using the chain rule:
δ^(l) = ( (W^(l+1))^T δ^(l+1) ) ⊙ σ'(z^(l))

Derivation Process:
z^(l+1) = W^(l+1)a^(l) + b^(l+1), and a^(l)=σ(z^(l))
→ ∂J/∂z^(l) = (∂J/∂z^(l+1)) · (∂z^(l+1)/∂a^(l)) · (∂a^(l)/∂z^(l))
→ ∂z^(l+1)/∂a^(l) = W^(l+1), ∂a^(l)/∂z^(l) = diag(σ'(z^(l)))
→ δ^(l) = (W^(l+1))^T δ^(l+1) ⊙ σ'(z^(l))

4. Parameter Gradient Calculation
Use the error term δ^(l) to calculate the gradients for weights and biases:

Weight Gradient: ∂J/∂W^(l) = δ^(l) (a^(l-1))^T
Derivation: z^(l) = W^(l)a^(l-1) + b^(l) → ∂J/∂W^(l) = ∂J/∂z^(l) · ∂z^(l)/∂W^(l) = δ^(l) (a^(l-1))^T
Bias Gradient: ∂J/∂b^(l) = δ^(l)
Derivation: ∂z^(l)/∂b^(l) = I → ∂J/∂b^(l) = δ^(l)

5. Algorithm Steps Summary

Forward Propagation: Compute z^(l) and a^(l) for each layer.
Calculate Output Error: δ^(L) = ŷ - y (using classification task as an example)
Backpropagate Error: For l=L-1 down to 1, compute δ^(l) = (W^(l+1))^T δ^(l+1) ⊙ σ'(z^(l))
Calculate Gradients: ∂J/∂W^(l) = δ^(l)(a^(l-1))^T, ∂J/∂b^(l) = δ^(l)
Update Parameters: Adjust W and b using gradient descent.

Key Points

The chain rule enables efficient gradient computation, avoiding redundant calculations.
Intermediate results from forward propagation (z^(l) and a^(l)) must be saved for use in backpropagation.
The derivative of the activation function σ' must be easy to compute (e.g., derivative of ReLU is 0 or 1).