Joint Gradient Derivation of Softmax Function and Cross-Entropy Loss
Problem Description
In multi-class classification tasks for neural networks, the Softmax function is often used in combination with cross-entropy loss. Interviews frequently require the derivation of the gradient for this combination, i.e., the partial derivative of the loss function L with respect to the model weights W, ∂L/∂W. Understanding this derivation process is crucial for mastering backpropagation.
Knowledge Explanation
-
Scenario Setting and Symbol Definitions
- Problem: We have a K-class classification problem.
- Model: For an input sample x (feature vector), the model first calculates its "score" (logit) for each class k:
z_k = w_k^T x + b_k. Here,w_kis the weight vector for class k, andb_kis the bias term. All scores form the vectorz = [z_1, z_2, ..., z_K]^T. - Softmax Layer: Transforms the score vector z into a probability distribution vector ŷ.
ŷ_i = Softmax(z_i) = e^{z_i} / (∑_{j=1}^{K} e^{z_j})
Here, ŷ_i represents the predicted probability that sample x belongs to class i, and it satisfies∑_{i=1}^{K} ŷ_i = 1. - True Label: The true label y is usually represented by a one-hot vector. For example, if the true class is c, then
y_c = 1, and for allj ≠ c,y_j = 0. - Cross-Entropy Loss Function: Measures the difference between the predicted probability distribution ŷ and the true distribution y.
L = - ∑_{i=1}^{K} y_i log(ŷ_i)
Since y is a one-hot vector (only 1 at the true class c, 0 elsewhere), the loss can be simplified to:
L = - y_c log(ŷ_c) = - log(ŷ_c)
-
Objective and Derivation Route
- Objective: Calculate the gradient of the loss L with respect to a specific weight vector
w_k(or a specific scorez_k), i.e.,∂L / ∂w_k(or∂L / ∂z_k). This is the core step of backpropagation. - Derivation Route: Apply the chain rule. To compute
∂L / ∂w_k, we decompose it as:
∂L / ∂w_k = (∂L / ∂z_k) * (∂z_k / ∂w_k)
Here,∂z_k / ∂w_k = x(becausez_k = w_k^T x + b_k). Therefore, the key is to find the "upstream" gradient∂L / ∂z_k, the partial derivative of the loss L with respect to the scorez_k. We will derive this for all k (from 1 to K).
- Objective: Calculate the gradient of the loss L with respect to a specific weight vector
-
Key Step: Calculate ∂L / ∂z_k
-
Application of the Chain Rule: The loss L is a function of ŷ, and each ŷ_j is a function of all z. Therefore, according to the multivariate chain rule:
∂L / ∂z_k = ∑_{j=1}^{K} (∂L / ∂ŷ_j) * (∂ŷ_j / ∂z_k)
This summation is necessary because changing one scorez_kaffects all predicted probabilitiesŷ_j. -
Calculate the First Term: ∂L / ∂ŷ_j
From the loss functionL = - ∑_{i=1}^{K} y_i log(ŷ_i), taking the derivative with respect to a specificŷ_j:
∂L / ∂ŷ_j = - y_j / ŷ_j
Note thaty_jis a constant (the true label). -
Calculate the Second Term: ∂ŷ_j / ∂z_k (This is the most crucial and error-prone step.)
This is the partial derivative of the Softmax function. Two cases need to be discussed because the derivative ofŷ_jwith respect toz_kyields different results forj = kandj ≠ k.-
Case 1: When j = k
ŷ_j = e^{z_j} / (∑_{m=1}^{K} e^{z_m})
LetS = ∑_{m=1}^{K} e^{z_m}. Find∂ŷ_j / ∂z_k(here j=k).
This uses the quotient rule:∂ŷ_j / ∂z_k = (e^{z_j} * S - e^{z_j} * e^{z_k}) / S^2
Since j=k, this simplifies to:(e^{z_j} * S - e^{z_j} * e^{z_j}) / S^2 = (e^{z_j} / S) * (1 - e^{z_j} / S) = ŷ_j (1 - ŷ_j) -
Case 2: When j ≠ k
ŷ_j = e^{z_j} / S
Now take the derivative with respect toz_k(k≠j). Heree^{z_j}is a constant, and S containse^{z_k}.
∂ŷ_j / ∂z_k = (0 * S - e^{z_j} * e^{z_k}) / S^2 = - (e^{z_j} / S) * (e^{z_k} / S) = - ŷ_j ŷ_k -
Unified Representation:
∂ŷ_j / ∂z_k = { ŷ_k (1 - ŷ_k), if j = k -ŷ_j ŷ_k, if j ≠ k }
This result can be concisely written as:∂ŷ_j / ∂z_k = ŷ_j (δ_{jk} - ŷ_k), whereδ_{jk}is the Kronecker delta function (1 when j=k, 0 otherwise).
-
-
Combine the Two Terms, Calculate the Final Gradient
Now substitute both terms into the sum:∂L / ∂z_k = ∑_{j=1}^{K} (∂L / ∂ŷ_j) * (∂ŷ_j / ∂z_k) = ∑_{j=1}^{K} (- y_j / ŷ_j) * (∂ŷ_j / ∂z_k)
Substituting the two cases for ∂ŷ_j / ∂z_k into the summation:
∂L / ∂z_k = ∑_{j=1}^{K} (- y_j / ŷ_j) * [ŷ_j (δ_{jk} - ŷ_k)]
Note thatŷ_jmagically cancels out in the numerator and denominator!
∂L / ∂z_k = - ∑_{j=1}^{K} y_j (δ_{jk} - ŷ_k)
Expanding the summation:
∂L / ∂z_k = - [ ∑_{j=1}^{K} y_j δ_{jk} - ∑_{j=1}^{K} y_j ŷ_k ]- First term
∑_{j=1}^{K} y_j δ_{jk}: δ_{jk}=1 only when j=k, so this term is simplyy_k. - Second term
∑_{j=1}^{K} y_j ŷ_k: ŷ_k is independent of j, so it can be factored out, becomingŷ_k ∑_{j=1}^{K} y_j. The true label y is a one-hot vector, so the sum of all its elements is 1, i.e.,∑_{j=1}^{K} y_j = 1. Therefore, the second term isŷ_k * 1 = ŷ_k.
Thus:
∂L / ∂z_k = - (y_k - ŷ_k) = ŷ_k - y_k
- First term
-
The Final Elegant Result
The gradient of the loss L with respect to the k-th class score is:
∂L / ∂z_k = ŷ_k - y_k
This is an extremely concise and important result. It states that the gradient is simply the model's predicted probability minus the true one-hot label.
-
-
Completing the Full Gradient Calculation
Now we return to the initial objective, calculating the gradient with respect to the weightw_k:
∂L / ∂w_k = (∂L / ∂z_k) * (∂z_k / ∂w_k) = (ŷ_k - y_k) * x
Similarly, for the bias termb_k:
∂L / ∂b_k = ŷ_k - y_k
Summary
The key to this derivation process lies in proficiently applying the chain rule and the derivative of the Softmax function. The final gradient formula ∂L / ∂z_k = ŷ_k - y_k is remarkably simple in form, making its computation in backpropagation very efficient. This result also intuitively tells us that when the prediction ŷ_k is far from the true value y_k, the gradient is large, requiring a significant update to the model parameters; when the prediction is close to the truth, the gradient becomes smaller, reducing the magnitude of the update.