Detailed Implementation and Application of Reservoir Sampling Algorithm
1. Problem Description
Reservoir Sampling is an online algorithm used to solve a classic problem:
Given a data stream (with unknown, potentially very large, or even memory-exceeding length), how to randomly select k elements from it with equal probability?
In other words, the probability of each element being selected must be \(k / n\), where \(n\) is the final total length of the data stream (but the algorithm does not know the value of \(n\) during processing).
2. Core Idea
The core of reservoir sampling is dynamic probability adjustment.
Assume we maintain a "reservoir" (i.e., result array) of size \(k\). The algorithm reads the data stream sequentially. For the \(i\)-th element (let \(i\) start counting from 1):
- If the current number of elements read \(i \leq k\), directly place it into the reservoir.
- If \(i > k\), decide whether to replace an element in the reservoir with the current element with probability \(k / i\) (choose a position in the reservoir uniformly at random for replacement).
This ensures that after processing the first \(n\) elements, the probability of each element remaining in the reservoir is exactly \(k / n\).
3. Algorithm Steps (Taking k=1 as an example, then generalizing to arbitrary k)
3.1 Simplest Case: k=1 (Selecting One Element Randomly)
- Initialize the result variable
reservoirtonull, and the countercount = 0. - Traverse the data stream. For the
i-th element (istarts from 1):- With probability \(1 / i\), update
reservoirto the current element. - With probability \(1 - 1/i\), keep
reservoirunchanged.
- With probability \(1 / i\), update
- After the traversal,
reservoiris the randomly selected element.
Correctness Proof (Mathematical Induction):
- When \(n=1\), the probability is \(1/1 = 1\), which obviously holds.
- Assume that when processing the \(n-1\)-th element, the probability of each element being selected is \(1/(n-1)\).
- For the \(n\)-th element:
- Its probability of being selected is \(1/n\).
- For any of the first \(n-1\) elements, its probability of being retained is:
\[ \frac{1}{n-1} \times \left(1 - \frac{1}{n}\right) = \frac{1}{n-1} \times \frac{n-1}{n} = \frac{1}{n} \]
- Therefore, the final probability of each element is \(1/n\).
3.2 General Case: Arbitrary k (Selecting k Elements Randomly)
- Initialize an array
reservoirof size \(k\). - Read the first \(k\) elements and directly place them into
reservoir. - Starting from the \(k+1\)-th element (let the current element be the \(i\)-th, where \(i > k\)):
- Randomly generate an integer \(r \in [0, i-1]\) (uniformly random).
- If \(r < k\), replace
reservoir[r]with the current element.
- After the traversal,
reservoircontains the randomly selected \(k\) elements.
Proof Idea:
- For the first \(k\) elements, their probability in the final reservoir:
- When processing the \(k+1\)-th element, the probability of each of the first \(k\) elements being replaced is:
\[ \frac{k}{k+1} \times \frac{1}{k} = \frac{1}{k+1} \]
Thus, the retention probability is $ 1 - 1/(k+1) = k/(k+1) $.
- Extending to the \(n\)-th element, the retention probability for each of the first \(k\) elements is:
\[ \frac{k}{k+1} \times \frac{k+1}{k+2} \times \dots \times \frac{n-1}{n} = \frac{k}{n} \]
- For the \(i\)-th element (where \(i > k\)):
- Its probability of being selected into the reservoir is \(k/i\).
- The probability of not being replaced afterward is:
\[ \frac{i}{i+1} \times \frac{i+1}{i+2} \times \dots \times \frac{n-1}{n} = \frac{i}{n} \]
- Therefore, the final probability is \((k/i) \times (i/n) = k/n\).
4. Implementation Example (Python)
import random
def reservoir_sampling(stream, k):
reservoir = []
for i, item in enumerate(stream):
if i < k:
reservoir.append(item)
else:
r = random.randint(0, i) # Generate a random integer in [0, i]
if r < k:
reservoir[r] = item
return reservoir
Note: Here, random.randint(0, i) includes both endpoints, so the range is \([0, i]\) with \(i+1\) numbers, resulting in probability \(k/(i+1)\). However, the standard algorithm typically uses \(k/i\). The two are asymptotically equivalent as long as the probability of each position being selected is inversely proportional to \(i\). A more standard way is:
r = random.randrange(0, i+1) # [0, i]
if r < k:
reservoir[r] = item
Or using floating-point numbers:
if random.random() < k / (i + 1):
j = random.randrange(0, k)
reservoir[j] = item
5. Application Scenarios
- Large Data Stream Random Sampling: Log sampling, network traffic analysis.
- Online Advertising Placement: Sampling from real-time clickstreams for effect analysis.
- Database Query Optimization: Quickly obtaining random samples from massive data for query plan evaluation.
- Machine Learning: Uniform sampling from data streams in distributed training.
6. Algorithm Variants and Extensions
- Weighted Reservoir Sampling: Each element has a different weight, requiring sampling according to weight probabilities.
- Distributed Reservoir Sampling: Data is distributed across multiple machines, each maintaining a local reservoir, which are merged later.
- Sliding Window Reservoir Sampling: Performing equal-probability sampling only on the most recent data in the stream.
7. Complexity Analysis
- Time Complexity: \(O(n)\), requiring only a single traversal.
- Space Complexity: \(O(k)\), only maintaining a reservoir of size \(k\).
- Random Number Generation Count: One random number is generated per element (except the first \(k\)), totaling \(O(n)\) times.
Summary: Reservoir sampling is a concise yet powerful random sampling algorithm. It allows us to perform strict equal-probability random sampling with only one traversal and minimal memory, even when the total data volume is unknown. The key lies in dynamically adjusting the replacement probability, ensuring that each element's final retention probability is \(k/n\).