Detailed Implementation and Application of Reservoir Sampling Algorithm

Detailed Implementation and Application of Reservoir Sampling Algorithm


1. Problem Description

Reservoir Sampling is an online algorithm used to solve a classic problem:
Given a data stream (with unknown, potentially very large, or even memory-exceeding length), how to randomly select k elements from it with equal probability?
In other words, the probability of each element being selected must be \(k / n\), where \(n\) is the final total length of the data stream (but the algorithm does not know the value of \(n\) during processing).


2. Core Idea

The core of reservoir sampling is dynamic probability adjustment.
Assume we maintain a "reservoir" (i.e., result array) of size \(k\). The algorithm reads the data stream sequentially. For the \(i\)-th element (let \(i\) start counting from 1):

  1. If the current number of elements read \(i \leq k\), directly place it into the reservoir.
  2. If \(i > k\), decide whether to replace an element in the reservoir with the current element with probability \(k / i\) (choose a position in the reservoir uniformly at random for replacement).
    This ensures that after processing the first \(n\) elements, the probability of each element remaining in the reservoir is exactly \(k / n\).

3. Algorithm Steps (Taking k=1 as an example, then generalizing to arbitrary k)

3.1 Simplest Case: k=1 (Selecting One Element Randomly)

  1. Initialize the result variable reservoir to null, and the counter count = 0.
  2. Traverse the data stream. For the i-th element (i starts from 1):
    • With probability \(1 / i\), update reservoir to the current element.
    • With probability \(1 - 1/i\), keep reservoir unchanged.
  3. After the traversal, reservoir is the randomly selected element.

Correctness Proof (Mathematical Induction):

  • When \(n=1\), the probability is \(1/1 = 1\), which obviously holds.
  • Assume that when processing the \(n-1\)-th element, the probability of each element being selected is \(1/(n-1)\).
  • For the \(n\)-th element:
    • Its probability of being selected is \(1/n\).
    • For any of the first \(n-1\) elements, its probability of being retained is:

\[ \frac{1}{n-1} \times \left(1 - \frac{1}{n}\right) = \frac{1}{n-1} \times \frac{n-1}{n} = \frac{1}{n} \]

  • Therefore, the final probability of each element is \(1/n\).

3.2 General Case: Arbitrary k (Selecting k Elements Randomly)

  1. Initialize an array reservoir of size \(k\).
  2. Read the first \(k\) elements and directly place them into reservoir.
  3. Starting from the \(k+1\)-th element (let the current element be the \(i\)-th, where \(i > k\)):
    • Randomly generate an integer \(r \in [0, i-1]\) (uniformly random).
    • If \(r < k\), replace reservoir[r] with the current element.
  4. After the traversal, reservoir contains the randomly selected \(k\) elements.

Proof Idea:

  • For the first \(k\) elements, their probability in the final reservoir:
    • When processing the \(k+1\)-th element, the probability of each of the first \(k\) elements being replaced is:

\[ \frac{k}{k+1} \times \frac{1}{k} = \frac{1}{k+1} \]

Thus, the retention probability is $ 1 - 1/(k+1) = k/(k+1) $.
  • Extending to the \(n\)-th element, the retention probability for each of the first \(k\) elements is:

\[ \frac{k}{k+1} \times \frac{k+1}{k+2} \times \dots \times \frac{n-1}{n} = \frac{k}{n} \]

  • For the \(i\)-th element (where \(i > k\)):
    • Its probability of being selected into the reservoir is \(k/i\).
    • The probability of not being replaced afterward is:

\[ \frac{i}{i+1} \times \frac{i+1}{i+2} \times \dots \times \frac{n-1}{n} = \frac{i}{n} \]

  • Therefore, the final probability is \((k/i) \times (i/n) = k/n\).

4. Implementation Example (Python)

import random

def reservoir_sampling(stream, k):
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)
        else:
            r = random.randint(0, i)  # Generate a random integer in [0, i]
            if r < k:
                reservoir[r] = item
    return reservoir

Note: Here, random.randint(0, i) includes both endpoints, so the range is \([0, i]\) with \(i+1\) numbers, resulting in probability \(k/(i+1)\). However, the standard algorithm typically uses \(k/i\). The two are asymptotically equivalent as long as the probability of each position being selected is inversely proportional to \(i\). A more standard way is:

r = random.randrange(0, i+1)  # [0, i]
if r < k:
    reservoir[r] = item

Or using floating-point numbers:

if random.random() < k / (i + 1):
    j = random.randrange(0, k)
    reservoir[j] = item

5. Application Scenarios

  1. Large Data Stream Random Sampling: Log sampling, network traffic analysis.
  2. Online Advertising Placement: Sampling from real-time clickstreams for effect analysis.
  3. Database Query Optimization: Quickly obtaining random samples from massive data for query plan evaluation.
  4. Machine Learning: Uniform sampling from data streams in distributed training.

6. Algorithm Variants and Extensions

  • Weighted Reservoir Sampling: Each element has a different weight, requiring sampling according to weight probabilities.
  • Distributed Reservoir Sampling: Data is distributed across multiple machines, each maintaining a local reservoir, which are merged later.
  • Sliding Window Reservoir Sampling: Performing equal-probability sampling only on the most recent data in the stream.

7. Complexity Analysis

  • Time Complexity: \(O(n)\), requiring only a single traversal.
  • Space Complexity: \(O(k)\), only maintaining a reservoir of size \(k\).
  • Random Number Generation Count: One random number is generated per element (except the first \(k\)), totaling \(O(n)\) times.

Summary: Reservoir sampling is a concise yet powerful random sampling algorithm. It allows us to perform strict equal-probability random sampling with only one traversal and minimal memory, even when the total data volume is unknown. The key lies in dynamically adjusting the replacement probability, ensuring that each element's final retention probability is \(k/n\).