Statistics and Cost Estimation in Database Query Optimization

Statistics and Cost Estimation in Database Query Optimization

Description
During the database query optimization process, the optimizer needs to select an "optimal" execution plan from multiple possible plans. This selection is not a random guess but is based on a rigorous "cost estimation" model. The core input to this model is "statistics"—system data describing the distribution characteristics of the data in tables. This topic will explain in detail how statistics are collected and stored, how the optimizer uses this information to estimate the cost of different operations (such as scans and joins), and ultimately how it selects the execution plan.

I. The Role and Content of Statistics
Statistics are the database's quantitative description of the characteristics of table data, akin to a "map" for the "terrain." Without them, the optimizer cannot estimate the amount of data a query needs to process, which may lead to inefficient plans such as choosing a full table scan over an index scan. Key statistics include:

Table-level Statistics: Total number of rows (cardinality).
Column-level Statistics:
- Number of Distinct Values (NDV).
- Proportion of NULL values.
- Data distribution histogram (e.g., equi-height histogram): Divides data into multiple buckets by value range, with each bucket recording the value range, frequency, etc., used to estimate the selectivity of range queries.
Index Statistics: Index levels (height), number of leaf nodes, etc.

II. Collection and Updating of Statistics
Statistics need to be updated regularly to maintain accuracy. Common methods include:

Automatic Collection: The database automatically triggers collection during idle periods or when data changes exceed a threshold (e.g., 10% of rows modified).
Manual Commands: Such as ANALYZE TABLE in MySQL, VACUUM ANALYZE in PostgreSQL.
Dynamic Sampling: When statistics are missing or outdated, the optimizer may temporarily estimate by randomly sampling a small amount of data.

III. Core Steps of Cost Estimation
The optimizer calculates the total cost (usually measured in I/O operations or CPU time) by combining cost models for different operations. Taking the query SELECT * FROM users WHERE age > 30 as an example:

Estimate Selectivity:
- Selectivity = Number of rows satisfying the condition / Total rows, reflecting the query's filtering capability.
- If the histogram for age shows that 40% of the data is over 30, then selectivity = 0.4.
Calculate Cardinality:
- Cardinality = Total rows × Selectivity. If the table has 1000 rows, then cardinality = 1000 × 0.4 = 400 rows.
Calculate Operation Cost:
- Full Table Scan Cost: Proportional to the number of data pages (assuming 100 rows per page, requiring 10 I/O operations).
- Index Scan Cost: = Index traversal cost + Table lookup cost. If the index height = 3, 3 I/O operations are needed to find the leaf node; for the table lookup, if 400 rows correspond to 200 data pages (assuming 2 rows per page), the total cost ≈ 3 + 200 = 203 I/O operations.
- After comparison, the optimizer might choose a full table scan (10 I/O operations < 203 I/O operations).

IV. Cost Estimation for Multi-Table Join Queries
Taking a two-table join SELECT * FROM A JOIN B ON A.id = B.id as an example:

Estimate Join Result Cardinality:
- If A has 1000 rows, B has 2000 rows, and the NDV for id in A is 500, and in B is 1000.
- Assuming uniform data distribution, join cardinality ≈ (Rows in A × Rows in B) / max(NDV_A, NDV_B) = (1000 × 2000) / 1000 = 2000 rows.
Compare Join Algorithm Costs:
- Nested Loop Join: Suitable for small tables driving large tables. Cost = Outer table scan cost + (Number of outer table rows × Inner table scan cost).
- Hash Join: Requires building a hash table. Cost is related to the data volume of both tables.
- Sort-Merge Join: Involves sorting cost (related to the logarithm of data volume).
- The optimizer selects the algorithm with the lowest estimated cost.

V. Optimization Pitfalls Due to Inaccurate Statistics
If statistics are outdated (e.g., a large amount of data >30 is added to the age field, but the histogram is not updated), the optimizer may underestimate the cardinality, incorrectly choose an index scan, and cause performance degradation. In such cases, manual updating of statistics or hints to the optimizer (e.g., FORCE INDEX in MySQL) are necessary.

Summary
Statistics are the "eyes" of query optimization, and cost estimation is the "decision-making brain." By quantifying data distribution through techniques like histograms, the optimizer can simulate the resource consumption of different execution plans and ultimately select an efficient path. In practice, monitoring should be combined to ensure the timeliness of statistics, with intervention and adjustments made when necessary.