Derivation and Gradient Calculation of the Softmax Function

Derivation and Gradient Calculation of the Softmax Function

Description:
The Softmax function is a core function in machine learning and deep learning for multi-class classification problems. It maps a K-dimensional vector of arbitrary real numbers (often called "logits") into a K-dimensional probability distribution, where each element's value lies in the interval (0,1), and the sum of all elements is 1. Understanding the derivation process of the Softmax function, i.e., calculating the gradient of its output with respect to its input, is crucial for training neural networks using gradient descent, as it is a key step in the backpropagation algorithm.

Detailed Explanation:

We will explain the definition of the Softmax function and its derivation process step by step.

Step 1: Understanding the Definition of the Softmax Function

Assume we have a classification problem with K categories. For an input sample, the final layer of the neural network outputs a K-dimensional vector z = [z1, z2, ..., zK], where zj is the score (logit) for the j-th category.

The Softmax function transforms this score vector into a probability distribution vector a = [a1, a2, ..., aK]. Each element aj in vector a (representing the probability that the sample belongs to the j-th class) is calculated as follows:

aj = e^{zj} / (e^{z1} + e^{z2} + ... + e^{zK})

To simplify notation, we introduce a summation symbol S, let S = Σ_{k=1}^{K} e^{zk}. Thus, the formula can be written as:

aj = e^{zj} / S

Step 2: Clarifying the Objective of the Derivation

In backpropagation, we need to compute the gradient of the loss function L with respect to the parameters of each layer of the network (including the input z to the Softmax layer). Specifically for the Softmax layer, we need to compute the partial derivative of each output probability ai with respect to each input score zj, i.e., ∂ai / ∂zj.

The key point here is that ai is a function of all elements in the vector z, not just zi. Therefore, when we take the derivative with respect to a specific zj, two cases arise:

When i = j: We compute the partial derivative of ai with respect to zi (∂ai / ∂zi).
When i ≠ j: We compute the partial derivative of ai with respect to zj (∂ai / ∂zj).

Step 3: Derivation for Case One (When i = j)

We want to compute ∂ai / ∂zi.

Given ai = e^{zi} / S, and S = Σ_{k=1}^{K} e^{zk}. Here, e^{zi} is the numerator, S is the denominator, and S also contains the term e^{zi}. Therefore, we need to use the quotient rule: (u/v)' = (u'v - uv') / v².

Let u = e^{zi}, then u' = ∂u/∂zi = e^{zi}.
Let v = S, then v' = ∂S/∂zi = ∂(e^{z1} + ... + e^{zi} + ... + e^{zK})/∂zi = e^{zi}. Because when differentiating with respect to zi, only the derivative of the term e^{zi} is non-zero.

Substituting into the quotient rule:
∂ai / ∂zi = (u'v - uv') / v² = (e^{zi} * S - e^{zi} * e^{zi}) / S²

Factor out e^{zi}:
= e^{zi} (S - e^{zi}) / S²

Divide both numerator and denominator by S and split into two terms:
= (e^{zi} / S) * ((S - e^{zi}) / S)

By definition ai = e^{zi} / S, and (S - e^{zi}) / S = 1 - e^{zi}/S = 1 - ai.

Therefore, we obtain the final result:
∂ai / ∂zi = ai (1 - ai)

Step 4: Derivation for Case Two (When i ≠ j)

We want to compute ∂ai / ∂zj.

Given ai = e^{zi} / S. Now, we differentiate with respect to zj, where j ≠ i.

The numerator e^{zi} does not contain zj, so it is treated as a constant, ∂(e^{zi})/∂zj = 0.
The denominator S contains e^{zj}, so ∂S/∂zj = e^{zj}.

Again using the quotient rule:
∂ai / ∂zj = (u'v - uv') / v² = (0 * S - e^{zi} * e^{zj}) / S² = (- e^{zi} e^{zj}) / S²

Recombining the terms:
= - (e^{zi} / S) * (e^{zj} / S)

By definition ai = e^{zi} / S and aj = e^{zj} / S.

Therefore, we obtain the final result:
∂ai / ∂zj = -ai * aj

Step 5: Summarizing the Gradient Formula

Now, we can write the gradient of the Softmax function as a complete, concise formula:

∂ai / ∂zj = ai (δij - aj)

Where δij is the Kronecker delta function, defined as:

δij = 1 when i = j
δij = 0 when i ≠ j

This formula perfectly summarizes the two cases derived in Steps 3 and 4:

When i = j, δij = 1, the formula becomes ai (1 - aj), i.e., ai (1 - ai).
When i ≠ j, δij = 0, the formula becomes ai (0 - aj) = -ai aj.

Step 6: Combining with Cross-Entropy Loss (Practical Application)

In the vast majority of classification tasks, the Softmax function is used in conjunction with the cross-entropy loss function. Assume the true label is in one-hot encoding, meaning only one class t is 1, and the others are 0. The cross-entropy loss is:
L = - Σ_{k=1}^{K} tk * log(ak)

Since only one tk=1 (let tt=1) and the rest are 0, the loss simplifies to:
L = - log(at)

Now, let's compute the gradient of the loss L with respect to the Softmax input zj. This is the gradient actually needed for backpropagation.

∂L / ∂zj = ∂(-log(at)) / ∂zj = - (1/at) * (∂at / ∂zj)

Substituting the Softmax gradient formula we just derived, ∂at / ∂zj = at (δtj - aj), into the above:

∂L / ∂zj = - (1/at) * [at (δtj - aj)] = - (δtj - aj) = aj - δtj

This result is remarkably simple and elegant:

If j is the true class t (i.e., j = t), then δtj = 1, and the gradient is aj - 1.
If j is not the true class t (i.e., j ≠ t), then δtj = 0, and the gradient is aj - 0 = aj.

This means that when calculating the gradient of the loss function with respect to the Softmax input, we do not need to trace back through the complex internal derivation of Softmax step by step. We only need to subtract the true label distribution t (one-hot vector) from the model's predicted probability distribution a:
∂L / ∂z = a - t

This concise gradient form is a great convenience brought by using Softmax together with cross-entropy loss and is one of the main reasons for its popularity in multi-class classification problems.