Cardinality Estimation and Cost Model in Database Query Execution Plans

Cardinality Estimation and Cost Model in Database Query Execution Plans

1. Basic Concepts of Cardinality Estimation and Cost Model

Cardinality Estimation is the process by which the query optimizer predicts the number of output rows for each operator (e.g., filter, join, group by). The Cost Model, based on cardinality estimation and combined with hardware characteristics (such as I/O and CPU costs), calculates the cost of different execution plans to select the optimal one. Together, they determine query performance quality.

2. Importance and Challenges of Cardinality Estimation

Importance: Incorrect cardinality estimation (e.g., actual rows 1 million, estimated 100 rows) may lead the optimizer to choose an inefficient plan (such as mistakenly using nested loop join instead of hash join).
Challenges:
- Uneven data distribution (e.g., rows with age=25 account for 90%).
- Multi-column correlation (e.g., strong correlation between city=Beijing and occupation=programmer).
- Complex query conditions (e.g., involving LIKE, BETWEEN, or subqueries).

3. Common Methods for Cardinality Estimation

(1) Basic Statistics

Databases collect statistics via commands like ANALYZE, including:

Histogram: Divides data into buckets by value range and records the frequency distribution per bucket. For example, after building a histogram for the age column, the proportion of age BETWEEN 20 AND 30 can be estimated.
Number of Distinct Values (NDV): Used to estimate the selectivity of equality queries (e.g., selectivity of age=25 is approximately 1/NDV).
Null Ratio, Data Density, etc.

(2) Assumptions and Simplifications

Independence Assumption: Assumes multi-column conditions are independent. For example, when estimating age=25 AND salary>5000, the product of their selectivities is calculated directly. However, real data may have correlations (e.g., younger people tend to have lower salaries), leading to estimation bias.
Uniform Distribution Assumption: Assumes data is uniformly distributed, but real-world data often shows skewness (e.g., a few cities have dense populations).

(3) Techniques to Address Correlation

Multi-column Statistics (Extended Statistics): Manually create statistics for column combinations (e.g., Oracle's DBMS_STATS.CREATE_EXTENDED_STATS).
Dynamic Sampling: Temporarily sample a small amount of data for complex queries to calibrate estimates.

4. Calculation Components of the Cost Model

The cost model decomposes an execution plan into operators (e.g., scan, join, sort) and assigns costs to each:

I/O Cost: Number of data page reads/writes (influenced by index type, caching).
CPU Cost: Computational effort required to process one row of data.
Memory Cost: Memory usage for operations like hash join or sort.

Example Cost Formula (Simplified):

Total Cost = Sequential Scan Cost + Filter Cost + Join Cost  
Sequential Scan Cost = Table Pages × Single Page I/O Cost  
Filter Cost = Scanned Rows × Single Row CPU Cost  
Nested Loop Join Cost = Outer Table Rows × Inner Table Single Scan Cost

5. Workflow of a Real Optimizer

Taking the estimation of SELECT * FROM users WHERE age>30 AND city='Beijing' as an example:

Collect Statistics: Histogram shows rows with age>30 account for 40%; NDV for city='Beijing' is 50 (selectivity 1/50).
Cardinality Estimation:
- Assuming independence, combined selectivity = 40% × 1/50 = 0.8%.
- If the table has 1 million total rows, estimated output rows = 1 million × 0.8% = 8000 rows.
Cost Calculation:
- Plan A: Full table scan (cost = 1 million rows × 0.1 cost unit = 100k units).
- Plan B: Scan on city index, then filter age>30 via table access (cost = index traversal + table access I/O).
- Compare costs and select the plan with lower cost.

6. Common Issues and Optimization Techniques

Scenarios with Large Estimation Bias:
- Multi-column correlated queries: Requires creating extended statistics.
- Complex expressions (e.g., UPPER(name)='ALICE'): Requires functional indexes or expression statistics.
Database Optimization Techniques:
- PostgreSQL: Supports multi-column statistics, custom statistics targets (ALTER TABLE ... SET STATISTICS).
- MySQL: Uses INDEX MERGE to handle multi-condition queries, but caution is needed regarding cost misjudgment in index merge.
- Oracle: Provides SQL Plan Management (SPM) to pin optimal plans, avoiding performance regression due to estimation errors.

7. Summary

Cardinality estimation and the cost model are the core of the query optimizer, and their accuracy directly depends on the quality of statistics. In practice, statistics should be updated regularly, execution plans for complex queries should be validated, and if necessary, optimizer hints (e.g., /*+ INDEX(...) */) or plan guidance techniques (like SQL Plan Baseline) should be used to correct biases.