Reservoir Sampling Algorithm
Problem Description
The reservoir sampling algorithm is used to address the following scenario: when the length N of a data stream (or linked list) is very large and cannot be loaded into memory all at once, how can we uniformly randomly select k elements (k ≥ 1). This algorithm guarantees that each element has an equal probability of being selected, specifically k/N, requires only a single pass through the data, and has a space complexity of only O(k).
Core Idea and Steps (for k=1)
When only one element needs to be randomly selected, the algorithm steps are as follows:
- Initialization: Initialize the result variable
reswith the first element. - Traversal Processing: For the i-th element (counting i from 1):
- Decide whether to replace
reswith the current element with a probability of 1/i. - Specific operation: Generate a random number
rin the range [0,1). Ifr < 1/i, then updateresto the current element.
- Decide whether to replace
- Termination: After the traversal ends,
resis the randomly selected result.
Why is the probability 1/N?
- First element: Probability of being selected = 1 (selected during initialization) × [Probability of not being replaced by the 2nd element] × ... × [Probability of not being replaced by the Nth element]
= 1 × (1 - 1/2) × (1 - 1/3) × ... × (1 - 1/N)
= 1 × (1/2) × (2/3) × ... × ((N-1)/N) = 1/N. - i-th element: Probability of being selected = 1/i (selected at step i) × [Probability of not being replaced by subsequent elements]
= 1/i × (1 - 1/(i+1)) × ... × (1 - 1/N) = 1/i × (i/(i+1)) × ... × ((N-1)/N) = 1/N.
Generalization for k ≥ 1
When k elements need to be selected, the algorithm steps are as follows:
- Initialization: Directly place the first k elements into the reservoir (result array
res). - Traversal Processing: For the i-th element (starting from i = k+1):
- Decide whether to add the current element to the reservoir with a probability of k/i.
- If the decision is to add, then randomly replace one element in the reservoir (choosing an index from 1 to k with equal probability).
- Termination: After the traversal ends, the k elements in the reservoir constitute the random sampling result.
Why is the probability of each element being selected k/N?
- For the first k elements:
They are definitely selected initially. When the i-th element (i > k) is added, the probability that a specific initial element gets replaced = (k/i) × (1/k) = 1/i.
Therefore, the probability that this initial element ultimately remains = ∏(i=k+1 to N) (1 - 1/i) = k/N. - For the i-th element (i > k):
Probability of being selected = k/i (selected at step i) × ∏(j=i+1 to N) [1 - (k/j) × (1/k)] (not replaced subsequently)
= k/i × ∏(j=i+1 to N) (1 - 1/j) = k/i × (i/N) = k/N.
Example Demonstration (k=2, data stream [a,b,c,d])
- Initialization:
res= [a, b] (i=2). - Process c (i=3): Decide whether to add it with probability 2/3. If added, randomly replace one position in
res. - Process d (i=4): Decide whether to add it with probability 2/4. If added, randomly replace one position in
res. - Ultimately, each of the two elements in
reshas a probability of being selected equal to 2/4 = 1/2.
Key Application Scenarios
- Random sampling from big data streams (e.g., log sampling, candidate set generation for recommendation systems).
- Random node selection from linked lists with unknown total length (e.g., shuffling a song playlist).
Through this clever probabilistic design, the reservoir sampling algorithm achieves fair random sampling from massive data streams with minimal space usage.