Cardinality Estimation Algorithms: Improvements and Comparisons of HyperLogLog

Cardinality Estimation Algorithms: Improvements and Comparisons of HyperLogLog

Cardinality estimation algorithms are used to count the number of distinct elements in a dataset. HyperLogLog (HLL) is one efficient algorithm, but it is not the only choice. This topic will introduce other classic cardinality estimation algorithms and compare them with HLL.

1. Problem Definition of Cardinality Estimation

Problem: Given a large-scale dataset containing duplicate elements, how to quickly estimate the number of distinct elements (cardinality)?
Challenge: Exact calculation (e.g., using a hash table) requires O(n) memory (where n is the cardinality), which is infeasible for massive data.
Goal: Achieve controllable error rates using sublinear memory (e.g., a few KB).

2. Linear Counting Algorithm

Core Idea: Use a bitmap of length m (initialized to all 0s).
- Hash each element, mapping it to a position in [0, m-1], and set that position to 1.
- Count the number of zeros U in the bitmap.
- Cardinality estimation formula: n̂ = -m * ln(U/m).
Key Principle: If the hash function is uniform, the expected value of U has a logarithmic relationship with the cardinality n, allowing n to be estimated via the inverse function.
Pros and Cons:
- Pros: Simple, high accuracy at low cardinalities.
- Cons: When n approaches m, hash collisions increase dramatically, causing errors to rise rapidly; memory consumption is O(m).

3. LogLog Algorithm

Core Idea: Use the geometric mean of the position of the first occurrence of 1 (rank) in the hash value to estimate cardinality.
- The hash function generates a sufficiently long binary string (e.g., 64 bits).
- Record the position of the first 1 from the lowest bit in the binary representation of each element's hash value (e.g., 0001...1 → rank=4).
- Take the maximum M of all ranks. Cardinality estimation: n̂ = α_m * m * 2^M (where m is the number of buckets, α_m is a correction factor).
Key Principle: If the cardinality n is large, at least one hash value will have a high first occurrence of 1 (e.g., on the order of 2^M), allowing n to be inferred from M.
Pros and Cons:
- Pros: Memory usage only requires storing m counters (a few bits each).
- Cons: Higher variance, error approximately 1.3/√m.

4. Improvements of HyperLogLog

Core Improvement: Use harmonic mean instead of geometric mean to reduce the impact of extreme values.
- Steps are the same as LogLog, but the final estimation formula is: n̂ = α_m * m² / (Σ 2^{-M_i}), where M_i is the maximum rank in each bucket.
- Harmonic mean is less sensitive to outliers, making estimates more stable.
Advantage: Error rate reduced to 1.04/√m, achieving higher accuracy with the same memory.

5. Algorithm Comparison and Selection

Memory Efficiency (fixed error rate 1%):
- Linear Counting: Requires about 10^4 bits.
- LogLog: Requires about 1.5KB.
- HyperLogLog: Requires about 1.5KB (but with lower error) or less memory to achieve the same error.
Applicable Scenarios:
- Low cardinality (n < m/10): Linear Counting is more accurate.
- High cardinality (n > m/10): HyperLogLog is optimal.
- Compromise: Practical systems (e.g., Redis) often combine Linear Counting and HLL, dynamically switching based on cardinality.

6. Practical Application Examples

Redis's PFCOUNT: Implemented using HLL, estimating trillion-level cardinalities with only 12KB of memory and error <1%.
Big Data Deduplication: Use HLL in data pipelines to pre-estimate unique user counts, avoiding full scans.

Through comparison, it is evident that HyperLogLog is the best choice balancing memory efficiency and accuracy in most scenarios. However, understanding its evolution and alternatives helps optimize designs for specific needs.