Training Instability Issues and Improvement Methods for Generative Adversarial Networks (GANs)

Training Instability Issues and Improvement Methods for Generative Adversarial Networks (GANs)

Problem Description:
Generative Adversarial Networks (GANs) often exhibit instability during training, manifesting as training collapse due to one component being too strong (e.g., mode collapse), gradient vanishing, or gradient explosion. How can these problems be solved?

Background Knowledge:
GANs consist of a Generator and a Discriminator. The Discriminator tries to distinguish real data from generated data, while the Generator tries to fool the Discriminator. Ideally, they reach a Nash equilibrium through adversarial training. However, due to imbalanced optimization objectives, the following issues commonly occur in practice:

Discriminator too strong: Generator gradients vanish, preventing learning.
Generator too strong: Discriminator becomes ineffective, and the Generator produces a single mode (mode collapse).
Unstable gradients: Loss function oscillates, making convergence difficult.

Solution Ideas and Steps:

1. Improving the Loss Function
The original GAN uses Jensen-Shannon (JS) divergence as the loss function. When the real and generated data distributions have no overlap, JS divergence saturates, causing gradient vanishing.

Solution: Use Wasserstein distance (WGAN) instead of JS divergence.
- Principle: Wasserstein distance provides effective gradients even when distributions do not overlap.
- Implementation: Change the Discriminator to a Critic, remove the final Sigmoid layer, and modify the loss function to:

\[ L = \mathbb{E}[D(x)] - \mathbb{E}[D(G(z))] \]

Constraint: The Critic must satisfy Lipschitz continuity (function gradient does not exceed a constant), enforced via Weight Clipping or Gradient Penalty.

2. Adding Gradient Penalty (WGAN-GP)
Weight clipping can lead to unstable gradients or wasted capacity. Gradient penalty directly constrains the Critic's gradient norm:

Steps:
1. Randomly sample interpolation points \(\hat{x}\) from the line connecting real and generated data.
2. Calculate the gradient norm \(\|\nabla D(\hat{x})\|_2\) of the Critic with respect to \(\hat{x}\).
3. Add a penalty term to the loss function: \(\lambda (\|\nabla D(\hat{x})\|_2 - 1)^2\), forcing the gradient norm to stay near 1.
Advantage: More stable training and helps avoid mode collapse.

3. Using More Stable Network Architectures

Deep Convolutional GAN (DCGAN):
- Replace fully connected layers with convolutional layers.
- Generator uses transposed convolutions for upsampling, Discriminator uses strided convolutions for downsampling.
- Use Batch Normalization to stabilize training (except for the Generator's output layer).
- Generator uses ReLU activations, output layer uses Tanh; Discriminator uses LeakyReLU.

4. Improving Optimization Strategies

Alternating training frequency: Avoid making the Discriminator too strong (e.g., train the Generator twice for every Discriminator update).
Using different optimizers: Original GANs often use Adam, but WGAN recommends RMSProp (to avoid momentum affecting gradient constraints).
Label smoothing: Change the Discriminator's real label from 1 to 0.9 to reduce overfitting.

5. Specialized Handling for Mode Collapse

Mini-batch discrimination: Allows the Discriminator to compare diversity within a batch; penalizes the Generator if generated samples are too similar.
Unrolled GAN: The Generator optimizes considering the Discriminator's state after multiple update steps, avoiding short-term overfitting.
Diversity loss: Add terms encouraging diversity to the Generator's loss (e.g., feature matching loss).

Summary:
Addressing GAN training instability requires a comprehensive approach:

Replace the original loss with WGAN-GP loss.
Adopt DCGAN architecture.
Control training pace and optimizer selection.
Add diversity constraints specifically for mode collapse.
These methods significantly improve GAN convergence and generation quality.