Reservoir Sampling Algorithm

Reservoir Sampling Algorithm

Problem Description
The reservoir sampling algorithm is used to address the following scenario: when the length N of a data stream (or linked list) is very large and cannot be loaded into memory all at once, how can we uniformly randomly select k elements (k ≥ 1). This algorithm guarantees that each element has an equal probability of being selected, specifically k/N, requires only a single pass through the data, and has a space complexity of only O(k).


Core Idea and Steps (for k=1)
When only one element needs to be randomly selected, the algorithm steps are as follows:

  1. Initialization: Initialize the result variable res with the first element.
  2. Traversal Processing: For the i-th element (counting i from 1):
    • Decide whether to replace res with the current element with a probability of 1/i.
    • Specific operation: Generate a random number r in the range [0,1). If r < 1/i, then update res to the current element.
  3. Termination: After the traversal ends, res is the randomly selected result.

Why is the probability 1/N?

  • First element: Probability of being selected = 1 (selected during initialization) × [Probability of not being replaced by the 2nd element] × ... × [Probability of not being replaced by the Nth element]
    = 1 × (1 - 1/2) × (1 - 1/3) × ... × (1 - 1/N)
    = 1 × (1/2) × (2/3) × ... × ((N-1)/N) = 1/N.
  • i-th element: Probability of being selected = 1/i (selected at step i) × [Probability of not being replaced by subsequent elements]
    = 1/i × (1 - 1/(i+1)) × ... × (1 - 1/N) = 1/i × (i/(i+1)) × ... × ((N-1)/N) = 1/N.

Generalization for k ≥ 1
When k elements need to be selected, the algorithm steps are as follows:

  1. Initialization: Directly place the first k elements into the reservoir (result array res).
  2. Traversal Processing: For the i-th element (starting from i = k+1):
    • Decide whether to add the current element to the reservoir with a probability of k/i.
    • If the decision is to add, then randomly replace one element in the reservoir (choosing an index from 1 to k with equal probability).
  3. Termination: After the traversal ends, the k elements in the reservoir constitute the random sampling result.

Why is the probability of each element being selected k/N?

  • For the first k elements:
    They are definitely selected initially. When the i-th element (i > k) is added, the probability that a specific initial element gets replaced = (k/i) × (1/k) = 1/i.
    Therefore, the probability that this initial element ultimately remains = ∏(i=k+1 to N) (1 - 1/i) = k/N.
  • For the i-th element (i > k):
    Probability of being selected = k/i (selected at step i) × ∏(j=i+1 to N) [1 - (k/j) × (1/k)] (not replaced subsequently)
    = k/i × ∏(j=i+1 to N) (1 - 1/j) = k/i × (i/N) = k/N.

Example Demonstration (k=2, data stream [a,b,c,d])

  1. Initialization: res = [a, b] (i=2).
  2. Process c (i=3): Decide whether to add it with probability 2/3. If added, randomly replace one position in res.
  3. Process d (i=4): Decide whether to add it with probability 2/4. If added, randomly replace one position in res.
  4. Ultimately, each of the two elements in res has a probability of being selected equal to 2/4 = 1/2.

Key Application Scenarios

  • Random sampling from big data streams (e.g., log sampling, candidate set generation for recommendation systems).
  • Random node selection from linked lists with unknown total length (e.g., shuffling a song playlist).

Through this clever probabilistic design, the reservoir sampling algorithm achieves fair random sampling from massive data streams with minimal space usage.