Application and Derivation of Cross-Entropy Loss Function in Classification Tasks

Application and Derivation of Cross-Entropy Loss Function in Classification Tasks

Description
The cross-entropy loss function is one of the core loss functions in classification tasks, widely used especially in models like logistic regression and neural networks. It measures the difference between the model's predicted probability distribution and the true probability distribution. Understanding the cross-entropy loss function requires starting from the foundations of information theory and gradually deriving it to specific applications.

1. Conceptual Basis of Information Content and Entropy

  • Information Content: Measures the amount of information brought by the occurrence of an event. The smaller the probability of the event, the greater the information content. The formula is: I(x) = -log(P(x)), where P(x) is the probability of the event occurring.
  • Entropy: Measures the uncertainty of a probability distribution. The higher the entropy, the greater the uncertainty. For a discrete distribution: H(P) = -Σ P(x) * log(P(x)), representing the expected amount of information required to identify a sample according to the true distribution P.

2. Definition and Intuitive Understanding of Cross-Entropy

  • Cross-Entropy: H(P, Q) = -Σ P(x) * log(Q(x)), where P is the true distribution and Q is the predicted distribution.
  • Intuitive Understanding:
    • If Q is exactly the same as P, the cross-entropy equals the entropy (minimum value).
    • If Q differs from P, the cross-entropy will be greater than the entropy, and the extra part is called the KL divergence (relative entropy).
  • Significance in Classification Tasks: Treating the true label as a probability distribution (e.g., in classification problems, the probability of the true class is 1, others are 0), cross-entropy measures the difference between the model's predicted distribution and the true distribution.

3. Derivation of Cross-Entropy Loss Function in Binary Classification

  • True Label: y ∈ {0, 1}, representing the positive or negative class.
  • Model Prediction: ŷ = σ(z) (sigmoid function output, representing the predicted probability of the positive class).
  • Loss Function Derivation:
    • Substitute the true label and prediction into the cross-entropy formula:
      • If y=1, the ideal prediction is ŷ=1, and the loss is -log(ŷ).
      • If y=0, the ideal prediction is ŷ=0, and the loss is -log(1-ŷ).
    • Combined formula: L = -[y * log(ŷ) + (1-y) * log(1-ŷ)].
  • Example: If y=1, ŷ=0.8, the loss is -log(0.8)≈0.223; if ŷ=0.1, the loss is -log(0.1)≈2.302, indicating a greater penalty when the prediction is wrong.

4. Cross-Entropy Loss Function in Multi-class Classification (Softmax Cross-Entropy)

  • True Label: One-hot encoded, e.g., y = [0, 1, 0].
  • Model Prediction: Softmax output probability distribution, e.g., ŷ = [0.2, 0.7, 0.1].
  • Loss Function: L = -Σ y_i * log(ŷ_i). Since only one position is 1 in the one-hot vector, it effectively only requires calculating the negative logarithm of the predicted probability corresponding to the true class.
  • Example: If the true class is the 2nd class and the predicted probability is 0.7, the loss is -log(0.7)≈0.357; if the predicted probability is 0.1, the loss is -log(0.1)≈2.302.

5. Comparison between Cross-Entropy and Mean Squared Error (MSE)

  • Issues with MSE: In classification tasks, the loss surface of MSE is non-convex, and the gradient is small in saturation regions (when predictions are close to 0 or 1), leading to slow learning.
  • Advantages of Cross-Entropy:
    • The gradient has a simple form (e.g., ŷ - y in binary classification), proportional to the error, resulting in higher learning efficiency.
    • It is a strictly convex function, making optimization easier.

6. Practical Considerations in Application

  • Numerical Stability: When computing log(ŷ), if ŷ is close to 0, it may cause numerical overflow. In practice, predictions should be clipped (e.g., constrained to [ε, 1-ε]).
  • Combination with Softmax: In neural networks, the Softmax layer is often jointly implemented with the cross-entropy loss, simplifying gradient calculations.

Through the above steps, the logic of the cross-entropy loss function, from theory to practice, is fully presented. Its core lies in quantifying the difference in probabilities to guide the model to quickly approximate the true distribution.