Principles of Statistics and Cost Estimation in Database Query Optimization
Problem Description
During the database query optimization process, the optimizer needs to select the optimal plan from multiple possible execution plans. This selection relies on Statistics and Cost Estimation. This article details how statistics are collected and stored, how the optimizer uses statistics to estimate query costs, and the principles of common estimation models (such as selectivity and cardinality estimation).
1. The Core Role of Statistics
Statistics are summaries of data distribution within database tables, such as:
- Table-level information: Number of rows, number of data pages;
- Column-level information: Number of distinct values (NDV), proportion of null values, data distribution (histogram);
- Index information: Index levels, number of leaf nodes, etc.
The optimizer uses statistics to estimate the "cost" of a query (e.g., CPU, I/O, memory overhead), thereby comparing the efficiency of different execution plans.
2. Collection and Storage of Statistics
(1) Collection Methods
- Automatic Collection: Periodically triggered by the database (e.g., MySQL's
auto_incrementstatistics, PostgreSQL'sautovacuum); - Manual Commands: Such as
ANALYZE TABLE(MySQL) orDBMS_STATS.GATHER_TABLE_STATS(Oracle).
(2) Key Statistical Metrics
- Cardinality: The number of distinct values (NDV) in a column. For example, the NDV for a gender column is typically 2;
- Histogram: Describes data distribution, especially crucial for non-uniform data (e.g., age, income):
- Equal-width histogram: Divides data into buckets by value range, recording the frequency of each bucket;
- Equal-height histogram: Divides data into buckets by frequency, so each bucket contains approximately the same number of rows.
- Correlation: Estimates the selectivity of multi-column joint conditions (e.g.,
WHERE city='Beijing' AND salary>10000).
3. Basic Principles of Cost Estimation
(1) Selectivity
Selectivity refers to the proportion of rows remaining after a condition is applied, relative to the total number of rows. For example:
WHERE id=10: Ifidis unique, selectivity ≈ 1/NDV;WHERE age>30: Determined via histogram as the proportion of values above 30;- Multi-condition selectivity: Assuming conditions are independent, joint selectivity = product of individual selectivities (actual may deviate due to correlations).
(2) Cardinality Estimation
Estimated number of rows in query result = Total table rows × Selectivity. For example:
- Table
usershas 10,000 rows, columnstatushas NDV=4 (values uniformly distributed), then the cardinality forWHERE status=1≈ 10,000 × 1/4 = 2,500 rows.
(3) Cost Model
The optimizer converts operations (e.g., index scan, full table scan) into cost units:
- I/O Cost: The cost of reading data pages (related to the number of pages);
- CPU Cost: The cost of processing rows (e.g., comparisons, sorting).
Total Cost = I/O Cost + CPU Cost.
4. Practical Example Analysis
Assume the table employees has 10,000 rows, with the following statistics:
- Column
departmenthas NDV=10, histogram shows uniform distribution; - Column
salaryhas a minimum of 3000, a maximum of 80,000, histogram divided into 5 buckets.
Query:
SELECT * FROM employees
WHERE department = 'HR' AND salary > 50000;
Step 1: Calculate Single-Condition Selectivity
department='HR': Selectivity ≈ 1/10 = 0.1;salary>50000: Histogram shows 20% of data is above 50,000, selectivity = 0.2.
Step 2: Joint Selectivity
Assuming conditions are independent, joint selectivity ≈ 0.1 × 0.2 = 0.02.
Step 3: Cardinality Estimation
Estimated row count = 10,000 × 0.02 = 200 rows.
Step 4: Compare Execution Plan Costs
- Full Table Scan: Cost ≈ I/O for reading 10,000 rows + CPU for filtering 10,000 rows;
- Index Scan (assuming an index on
(department, salary)): Cost ≈ I/O for index range scan + I/O for table lookup of 200 rows.
The optimizer will select the plan with the lower cost.
5. Common Issues and Optimization of Statistics
- Outdated Statistics: Statistics are not refreshed promptly after frequent data updates, leading to estimation deviations (e.g., actual row count is much higher than estimated);
- Data Correlation Bias: For example,
country='China' AND city='Beijing'has strong correlation, and the independence assumption will underestimate selectivity; - Solutions:
- Regularly update statistics (especially after large-scale data changes);
- Use dynamic sampling (e.g., Oracle's dynamic statistics) to supplement statistics in real-time;
- Create extended statistics (e.g., Oracle's column groups) to capture multi-column correlations.
Summary
Statistics and cost estimation are the "eyes" of the query optimizer. Their accuracy directly determines the quality of execution plans. Understanding principles such as histograms, selectivity, and cardinality estimation helps improve query performance by optimizing statistics collection strategies or using hints to influence optimizer decisions.