Multi-class Classification in Logistic Regression: A Detailed Explanation of Softmax Regression

Multi-class Classification in Logistic Regression: A Detailed Explanation of Softmax Regression

Description
Logistic Regression is inherently designed for binary classification problems. However, real-world classification tasks often involve multiple categories (e.g., handwritten digit recognition, object classification). Softmax Regression (also known as Multinomial Logistic Regression) is an extension of logistic regression to multi-class problems. It uses the Softmax function to transform multiple linear outputs into a probability distribution, thereby enabling classification into multiple categories.

Solution Process

1. Mathematical Formulation of Multi-class Classification
Assume there are \(K\) classes (\(K \geq 3\)). For each sample, the feature vector is \(\mathbf{x} \in \mathbb{R}^n\), and the label \(y\) takes a value from \(\{1, 2, ..., K\}\). Softmax Regression learns a parameter vector \(\mathbf{w}_k \in \mathbb{R}^n\) for each class \(k\) and computes a score for the sample belonging to each class:

\[z_k = \mathbf{w}_k^\top \mathbf{x} \quad (\text{Note: The bias term is usually omitted or already incorporated into } \mathbf{w}_k) \]

2. Softmax Function: Converting Scores to Probabilities
The Softmax function maps the \(K\) scores \(z_1, z_2, ..., z_K\) to a probability distribution:

\[P(y=k \mid \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} = \frac{e^{\mathbf{w}_k^\top \mathbf{x}}}{\sum_{j=1}^{K} e^{\mathbf{w}_j^\top \mathbf{x}}} \]

Properties:

The sum of all probabilities is 1 (\(\sum_{k=1}^{K} P(y=k \mid \mathbf{x}) = 1\)).
Probabilities are non-negative and influenced by the relative magnitude of the scores (higher scores yield higher probabilities).

3. Loss Function: Cross-Entropy Loss
For the true label \(y=i\), we want the model's predicted probability \(P(y=i \mid \mathbf{x})\) to be as close to 1 as possible. Cross-entropy loss measures the discrepancy between the predicted probability and the true distribution:

\[L(\mathbf{W}) = -\sum_{k=1}^{K} \mathbb{I}(y=k) \log P(y=k \mid \mathbf{x}) \]

where \(\mathbb{I}(y=k)\) is the indicator function (1 if \(y=k\), 0 otherwise). In practice, only the term corresponding to the true class \(i\) is needed:

\[L(\mathbf{W}) = -\log P(y=i \mid \mathbf{x}) \]

Example: If the true class is \(i=2\) and the model predicts probabilities \([0.1, 0.7, 0.2]\), the loss is \(-\log(0.7) \approx 0.357\).

4. Optimization via Gradient Descent
The goal is to minimize the sum of losses over all training samples. The gradient with respect to parameter \(\mathbf{w}_k\) is (derivation omitted):

\[\frac{\partial L}{\partial \mathbf{w}_k} = \left( P(y=k \mid \mathbf{x}) - \mathbb{I}(y=k) \right) \mathbf{x} \]

Physical Interpretation:

If the sample belongs to class \(k\) (i.e., \(y=k\)), the gradient is \((P(y=k \mid \mathbf{x}) - 1)\mathbf{x}\). The model updates \(\mathbf{w}_k\) to increase \(P(y=k \mid \mathbf{x})\).
If the sample does not belong to class \(k\) (i.e., \(y \neq k\)), the gradient is \(P(y=k \mid \mathbf{x})\mathbf{x}\). The model reduces the probability for other classes.
Parameters are iteratively updated (\(\mathbf{w}_k \leftarrow \mathbf{w}_k - \eta \frac{\partial L}{\partial \mathbf{w}_k}\)) to gradually optimize the model.

5. Relationship to Binary Logistic Regression
When \(K=2\), Softmax Regression is equivalent to binary logistic regression (one of the parameter vectors can be set to zero). However, in practice, binary classification problems typically use the Sigmoid function directly to avoid redundant parameters.

Summary
Softmax Regression extends the probability output mechanism of logistic regression to effectively solve multi-class classification problems using cross-entropy loss and gradient descent. Its core lies in the Softmax function, which converts scores to probabilities, and the gradient updates that reinforce the correct class and suppress incorrect ones.