Activation Functions: Comparison and Selection of Sigmoid, Tanh, and ReLU

Activation Functions: Comparison and Selection of Sigmoid, Tanh, and ReLU

Topic Description
In neural networks, activation functions are responsible for introducing nonlinear characteristics to neurons, enabling the network to learn complex patterns. Common activation functions include Sigmoid, Tanh, and ReLU. This topic requires an understanding of their mathematical forms, advantages and disadvantages, applicable scenarios, and mastering selection strategies.

1. Role of Activation Functions

Core Function: Convert input signals to output signals, and must be a nonlinear function (if a linear function is used, a multi-layer network will degenerate into a single layer).
Example: Assuming a neuron input is \(z = w_1x_1 + w_2x_2 + b\), the activation function is \(f(z)\), then the output is \(a = f(z)\).

2. Sigmoid Function

Mathematical Form:

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]

Characteristics:
- Output range is (0, 1), suitable for representing probability (e.g., output layer for binary classification).
- Derivative: \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\), maximum value is 0.25 (when \(z=0\)).
Disadvantages:
- Vanishing Gradient: When \(|z|\) is large, the derivative approaches 0, causing gradients to decay exponentially during backpropagation.
- Non-zero Centered: Output mean is positive, causing gradient updates to oscillate in a "Z" shape, slowing convergence.
- Computational Cost: Involves exponential operations.

3. Tanh Function (Hyperbolic Tangent)

Mathematical Form:

\[ \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} \]

Characteristics:
- Output range is (-1, 1), a zero-centered function (mean of 0).
- Derivative: \(\tanh'(z) = 1 - \tanh^2(z)\), maximum value is 1 (when \(z=0\)).
Comparison with Sigmoid:
- Vanishing gradient problem still exists (but milder than Sigmoid, as the maximum gradient is 1).
- Zero-centered property enables faster convergence, making it more suitable for hidden layers.

4. ReLU Function (Rectified Linear Unit)

Mathematical Form:

\[ \text{ReLU}(z) = \max(0, z) \]

Advantages:
- Derivative is 1 when \(z > 0\), effectively alleviating the vanishing gradient problem.
- Simple computation (only needs to determine positive or negative).
Disadvantages:
- Dying ReLU Problem: If input is negative, the gradient is always 0, causing the neuron to "die" and not recover.
- Output is non-zero centered.

5. Improved ReLU Variants

Leaky ReLU:

\[ f(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases} \quad (\alpha \text{ is a small positive number, e.g., } 0.01) \]

Solves the Dying ReLU problem by providing a weak gradient in the negative region.
Parametric ReLU (PReLU): Treats \(\alpha\) as a learnable parameter, adaptively adjusting the slope in the negative region.

6. Selection Strategy and Practical Advice

Output Layer:
- Binary classification problem: Sigmoid (output represents probability).
- Multi-class classification problem: Softmax (covered in this topic, not repeated here).
Hidden Layers:
- Preferred choice is ReLU: Computationally efficient and fast convergence (especially for deep networks).
- If concerned about Dying ReLU, consider Leaky ReLU or PReLU.
- Tanh can still be effective in certain scenarios (e.g., RNNs), but be mindful of vanishing gradients.
Important Considerations:
- Avoid using Sigmoid/Tanh as hidden layers in deep networks (severe vanishing gradient problem).
- ReLU series requires He initialization (maintains stable variance during forward propagation).

Summary
Selecting an activation function requires balancing nonlinear capability, gradient characteristics, and computational efficiency. In modern neural networks, ReLU and its variants are the default choice for hidden layers, while Sigmoid/Tanh are more often used for specific output layers or traditional models.