Loss Function of Logistic Regression and Gradient Descent Optimization

Loss Function of Logistic Regression and Gradient Descent Optimization

Description
Logistic regression is a widely used machine learning algorithm for classification tasks, particularly suitable for binary classification problems. Its core idea is to map input features to probability values between 0 and 1 through a linear combination plus an activation function (such as Sigmoid). To train a logistic regression model, it is necessary to define a loss function to measure the discrepancy between predicted values and true values, and use optimization algorithms (such as gradient descent) to minimize this loss function, thereby learning the model parameters. Understanding the derivation of the loss function and its optimization process is key to mastering logistic regression.

Solution Process

Logistic Regression Model and Sigmoid Function
The hypothesis function of the logistic regression model is:
\(h_{\theta}(x) = \sigma(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}\)
where \(\theta\) is the parameter vector, \(x\) is the input feature vector, and \(\sigma(z)\) is the Sigmoid function. The Sigmoid function compresses the linear output \(z = \theta^T x\) to the interval (0,1), representing the probability that a sample belongs to the positive class:
\(P(y=1|x; \theta) = h_{\theta}(x)\), \(P(y=0|x; \theta) = 1 - h_{\theta}(x)\).
Derivation of the Loss Function: Cross-Entropy Loss
For binary classification problems (label \(y \in \{0,1\}\)), we want the loss function to penalize the difference between the predicted probability and the true label. Derivation using maximum likelihood estimation (MLE):
- Likelihood function: \(L(\theta) = \prod_{i=1}^{m} [h_{\theta}(x^{(i)})]^{y^{(i)}} [1 - h_{\theta}(x^{(i)})]^{1 - y^{(i)}}\)
- Taking the negative log-likelihood (for easier minimization) yields the cross-entropy loss:
  \(J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right]\)
  This function has good convexity, which facilitates optimization, and the penalty for incorrect predictions increases with the magnitude of the error.
Gradient Calculation
Gradient descent requires calculating the partial derivative of the loss function with respect to the parameter \(\theta_j\). Key steps:
- First, compute the derivative of the Sigmoid function: \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\)
- Derive the loss for a single sample:
  \(\frac{\partial}{\partial \theta_j} \left[ -y \log(h_{\theta}(x)) - (1-y) \log(1 - h_{\theta}(x)) \right] = (h_{\theta}(x) - y) x_j\)
- Overall gradient (m samples):
  \(\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}\)
  The gradient form is concise and similar to the gradient of linear regression (but with a different \(h_{\theta}(x)\)).
Gradient Descent Optimization
Use gradient descent to iteratively update the parameters (learning rate \(\alpha\)):
\(\theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}\)
Repeat until convergence. In practice, stochastic gradient descent (SGD) or mini-batch gradient descent is often used to accelerate training.
Regularization to Prevent Overfitting
To avoid overfitting, an L2 regularization term can be added to the loss function:
\(J(\theta) = \text{original loss} + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2\)
During gradient update, an additional subtraction of \(\frac{\lambda}{m} \theta_j\) is required, causing the parameters to tend towards smaller values and improving generalization ability.