Receptive Field Calculation and Significance in Convolutional Neural Networks
1. Definition and Importance of Receptive Field
Receptive Field refers to the size of the region in the input image that corresponds to a single pixel on the output feature map of a Convolutional Neural Network. For example, if a feature point is computed from a 5×5 area on the input image, its receptive field is 5×5. The size of the receptive field determines the contextual range the network can capture. A larger receptive field helps in recognizing large objects, while a smaller one is more suitable for fine-grained features.
2. Methods for Calculating Receptive Field
Step 1: Basic Formula
The receptive field is calculated by backward propagation layer by layer. Let:
- \(l\) be the index of the current layer (input layer is \(l=0\)).
- \(RF_l\) be the receptive field size of layer \(l\).
- \(k_l\) be the kernel size of layer \(l\).
- \(s_l\) be the stride of layer \(l\).
The recurrence formula is:
\[RF_l = RF_{l-1} + (k_l - 1) \times \prod_{i=1}^{l-1} s_i \]
However, a more commonly used simplified version is calculating backward layer by layer:
\[RF_{l-1} = RF_l + (k_l - 1) \times J_{l-1} \]
where \(J_{l-1}\) is the cumulative stride from layer \(l-1\) to layer \(l\) (i.e., the product of all previous layer strides).
Step 2: Calculation Example
Assume a simple network structure as follows:
- Input image: \(224\times224\)
- Convolution layer 1: \(k_1=3, s_1=1\)
- Pooling layer 1: \(k_2=2, s_2=2\)
- Convolution layer 2: \(k_3=3, s_3=1\)
Backward derivation starting from the last layer:
- Convolution layer 2 (layer 3): Initial receptive field \(RF_3 = 1\) (the point itself).
- Derive to the previous layer (Pooling layer 1):
\[ RF_2 = RF_3 + (k_3 - 1) \times J_2 \]
where $ J_2 = s_1 \times s_2 = 1 \times 2 = 2 $, substituting:
\[ RF_2 = 1 + (3-1) \times 2 = 5 \]
- Pooling layer 1 (layer 2): Continue deriving to Convolution layer 1:
\[ RF_1 = RF_2 + (k_2 - 1) \times J_1 \]
where \(J_1 = s_1 = 1\), substituting:
\[ RF_1 = 5 + (2-1) \times 1 = 6 \]
- Convolution layer 1 (layer 1): Derive to the input image:
\[ RF_0 = RF_1 + (k_1 - 1) \times J_0 \]
where \(J_0 = 1\) (input layer has no stride), substituting:
\[ RF_0 = 6 + (3-1) \times 1 = 8 \]
Finally, a feature point in the last convolutional layer corresponds to an 8×8 region in the input image.
3. Factors Influencing the Receptive Field
- Kernel size: Larger \(k\) leads to faster receptive field growth.
- Stride: Larger stride leads to faster receptive field expansion (because \(J\) accumulates more).
- Dilated Convolution: By expanding the effective range of the kernel (e.g., dilation rate \(d\)), the actual receptive field becomes \(k' = k + (k-1)(d-1)\).
- Network depth: More layers typically result in a larger receptive field.
4. Considerations in Practical Applications
- Matching receptive field with object size: If the receptive field is much smaller than the target object, the network struggles to understand the global context; if it is much larger, it may include too much irrelevant background.
- Effective receptive field: Studies show that the influence of a feature point on the input image follows a Gaussian distribution, with peripheral regions contributing less; the actual effective receptive field is smaller than the theoretical value.
- Design of modern networks: Architectures like Inception and ResNet replace large kernels with stacks of small kernels, maintaining the receptive field while reducing computational cost.
5. Summary
The receptive field is a key metric for assessing the feature extraction scope of a network, and it can be controllably adjusted through the combination of kernel size, stride, and depth. Understanding the receptive field aids in designing more reasonable network architectures and avoiding mismatches between feature scales and task requirements.