Derivation and Gradient Calculation of the Softmax Function
Problem Description
The Softmax function is a core activation function used in deep learning and machine learning for multi-class classification tasks. It transforms a K-dimensional real-valued vector into a probability distribution. Job interviews often require the derivation of the partial derivatives of the Softmax function with respect to its input variables and an explanation of its application in backpropagation. Understanding this process is crucial for mastering neural network training.
1. Definition of the Softmax Function
For a K-class classification problem, given an input vector z = [z₁, z₂, ..., zₖ]ᵀ, the Softmax function computes the probability for the i-th class as:
\[p_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \]
Here, the denominator is the sum of all exponential values, ensuring ∑pᵢ = 1 and pᵢ ∈ (0,1). For example, if z = [2, 1, 0.1], then the denominator is e² + e¹ + e⁰·¹ ≈ 7.39 + 2.72 + 1.11 = 11.22, resulting in probabilities p = [0.63, 0.23, 0.10].
2. Derivative Scenario Analysis
We need to find ∂pᵢ/∂zⱼ, i.e., the partial derivative of the i-th output with respect to the j-th input. There are two cases:
- When i = j: Find the partial derivative of pᵢ with respect to its own input zᵢ.
- When i ≠ j: Find the partial derivative of pᵢ with respect to another input zⱼ.
3. Derivative Process for i = j
Treating pᵢ as the quotient of the numerator e^{zᵢ} and the denominator S = ∑e^{zₖ}:
\[\frac{\partial p_i}{\partial z_i} = \frac{e^{z_i} \cdot S - e^{z_i} \cdot e^{z_i}}{S^2} \]
Factoring out e^{zᵢ}:
\[= \frac{e^{z_i}(S - e^{z_i})}{S^2} = \frac{e^{z_i}}{S} \cdot \frac{S - e^{z_i}}{S} \]
Substituting pᵢ = e^{zᵢ}/S:
\[= p_i \cdot (1 - p_i) \]
4. Derivative Process for i ≠ j
In this case, the numerator e^{zᵢ} is independent of zⱼ, only the denominator S contains e^{zⱼ}. Using the quotient rule:
\[\frac{\partial p_i}{\partial z_j} = \frac{0 \cdot S - e^{z_i} \cdot e^{z_j}}{S^2} \]
Simplifying:
\[= -\frac{e^{z_i} e^{z_j}}{S^2} = -\frac{e^{z_i}}{S} \cdot \frac{e^{z_j}}{S} \]
Substituting the definitions of pᵢ and pⱼ:
\[= -p_i p_j \]
5. Unified Gradient Formula
Combining both cases and introducing the Kronecker delta function δᵢⱼ (which is 1 when i=j, otherwise 0):
\[\frac{\partial p_i}{\partial z_j} = p_i (\delta_{ij} - p_j) \]
For example, for the Softmax output p of z = [2, 1, 0.1], ∂p₁/∂z₂ = -0.63×0.23 ≈ -0.145, indicating that p₁ decreases when z₂ increases.
6. Backpropagation in Cross-Entropy Loss
Let the true label y be a one-hot vector. The cross-entropy loss is L = -∑yₖ log pₖ. During backpropagation, we need to compute ∂L/∂zⱼ:
\[\frac{\partial L}{\partial z_j} = -\sum_{k=1}^K y_k \frac{\partial \log p_k}{\partial z_j} \]
Substituting ∂log pₖ/∂zⱼ = (1/pₖ)·∂pₖ/∂zⱼ and using the Softmax derivative formula:
\[= -\sum_k y_k (\delta_{kj} - p_j) = p_j \sum_k y_k - y_j \]
Since ∑yₖ = 1, we arrive at the key conclusion:
\[\frac{\partial L}{\partial z_j} = p_j - y_j \]
This gradient directly represents the difference between the predicted probability and the true label, enabling efficient parameter updates.
Summary
The core of Softmax derivation lies in case-by-case analysis. The final result, p_i(δᵢⱼ - pⱼ), reflects the competitive relationship among probabilities. In the context of cross-entropy loss, this derivation leads to an exceptionally concise gradient calculation, which is a major reason why Softmax is the preferred choice for multi-class classification.