Cardinality Estimation and Cost Model in Database Query Execution Plans
1. Basic Concepts of Cardinality Estimation and Cost Model
Cardinality Estimation is the process by which the query optimizer predicts the number of output rows for each operator (e.g., filter, join, group by). The Cost Model, based on cardinality estimation and combined with hardware characteristics (such as I/O and CPU costs), calculates the cost of different execution plans to select the optimal one. Together, they determine query performance quality.
2. Importance and Challenges of Cardinality Estimation
- Importance: Incorrect cardinality estimation (e.g., actual rows 1 million, estimated 100 rows) may lead the optimizer to choose an inefficient plan (such as mistakenly using nested loop join instead of hash join).
- Challenges:
- Uneven data distribution (e.g., rows with
age=25account for 90%). - Multi-column correlation (e.g., strong correlation between
city=Beijingandoccupation=programmer). - Complex query conditions (e.g., involving
LIKE,BETWEEN, or subqueries).
- Uneven data distribution (e.g., rows with
3. Common Methods for Cardinality Estimation
(1) Basic Statistics
Databases collect statistics via commands like ANALYZE, including:
- Histogram: Divides data into buckets by value range and records the frequency distribution per bucket. For example, after building a histogram for the
agecolumn, the proportion ofage BETWEEN 20 AND 30can be estimated. - Number of Distinct Values (NDV): Used to estimate the selectivity of equality queries (e.g., selectivity of
age=25is approximately1/NDV). - Null Ratio, Data Density, etc.
(2) Assumptions and Simplifications
- Independence Assumption: Assumes multi-column conditions are independent. For example, when estimating
age=25 AND salary>5000, the product of their selectivities is calculated directly. However, real data may have correlations (e.g., younger people tend to have lower salaries), leading to estimation bias. - Uniform Distribution Assumption: Assumes data is uniformly distributed, but real-world data often shows skewness (e.g., a few cities have dense populations).
(3) Techniques to Address Correlation
- Multi-column Statistics (Extended Statistics): Manually create statistics for column combinations (e.g., Oracle's
DBMS_STATS.CREATE_EXTENDED_STATS). - Dynamic Sampling: Temporarily sample a small amount of data for complex queries to calibrate estimates.
4. Calculation Components of the Cost Model
The cost model decomposes an execution plan into operators (e.g., scan, join, sort) and assigns costs to each:
- I/O Cost: Number of data page reads/writes (influenced by index type, caching).
- CPU Cost: Computational effort required to process one row of data.
- Memory Cost: Memory usage for operations like hash join or sort.
Example Cost Formula (Simplified):
Total Cost = Sequential Scan Cost + Filter Cost + Join Cost
Sequential Scan Cost = Table Pages × Single Page I/O Cost
Filter Cost = Scanned Rows × Single Row CPU Cost
Nested Loop Join Cost = Outer Table Rows × Inner Table Single Scan Cost
5. Workflow of a Real Optimizer
Taking the estimation of SELECT * FROM users WHERE age>30 AND city='Beijing' as an example:
- Collect Statistics: Histogram shows rows with
age>30account for 40%; NDV forcity='Beijing'is 50 (selectivity 1/50). - Cardinality Estimation:
- Assuming independence, combined selectivity = 40% × 1/50 = 0.8%.
- If the table has 1 million total rows, estimated output rows = 1 million × 0.8% = 8000 rows.
- Cost Calculation:
- Plan A: Full table scan (cost = 1 million rows × 0.1 cost unit = 100k units).
- Plan B: Scan on
cityindex, then filterage>30via table access (cost = index traversal + table access I/O). - Compare costs and select the plan with lower cost.
6. Common Issues and Optimization Techniques
- Scenarios with Large Estimation Bias:
- Multi-column correlated queries: Requires creating extended statistics.
- Complex expressions (e.g.,
UPPER(name)='ALICE'): Requires functional indexes or expression statistics.
- Database Optimization Techniques:
- PostgreSQL: Supports multi-column statistics, custom statistics targets (
ALTER TABLE ... SET STATISTICS). - MySQL: Uses
INDEX MERGEto handle multi-condition queries, but caution is needed regarding cost misjudgment in index merge. - Oracle: Provides SQL Plan Management (SPM) to pin optimal plans, avoiding performance regression due to estimation errors.
- PostgreSQL: Supports multi-column statistics, custom statistics targets (
7. Summary
Cardinality estimation and the cost model are the core of the query optimizer, and their accuracy directly depends on the quality of statistics. In practice, statistics should be updated regularly, execution plans for complex queries should be validated, and if necessary, optimizer hints (e.g., /*+ INDEX(...) */) or plan guidance techniques (like SQL Plan Baseline) should be used to correct biases.