Statistical Information and Cost Estimation in Database Query Optimization

Statistical Information and Cost Estimation in Database Query Optimization

Topic Description:
The core task of a database query optimizer is to select the optimal execution plan for an SQL query, and statistical information along with cost estimation are the cornerstones of this process. This topic will delve into how statistical information is collected and maintained, how the optimizer uses this information to estimate query cost, and the causes and coping strategies for common estimation errors.

Solution Process:

1. The Core Role of Statistical Information

Definition: Statistical information is metadata describing the distribution characteristics of data in tables, such as the total number of rows in a table, column cardinality (number of distinct values), data distribution histograms, etc.
Importance: The optimizer relies on statistical information to estimate the size of the query result set (cardinality) and calculate the cost (e.g., CPU, I/O overhead) of different execution plans, thereby selecting the plan with the lowest cost.
Example: For a query with WHERE age > 30, if the optimizer knows from a histogram that 50% of the data is over 30 years old, it can estimate scanning half of the data pages, and then compare the cost of an index scan versus a full table scan.

2. Collection and Maintenance of Statistical Information

Automatic Collection: Modern databases (such as MySQL/PostgreSQL) support periodic automatic collection of statistical information by sampling data pages to update metadata.
Manual Trigger: Executing ANALYZE TABLE (MySQL) or VACUUM ANALYZE (PostgreSQL) can force an update of statistical information.
Update Strategy: Re-collection is triggered when data changes exceed a threshold (e.g., 10% of rows modified) to prevent plan degradation caused by stale statistics.

3. Basic Principles of Cost Estimation

Cost Model: The optimizer converts operations (such as scans, joins, sorts) into a weighted sum of CPU, memory, and I/O costs. For example:
- Sequential Scan Cost: Number of table data pages × Single-page I/O cost
- Index Scan Cost: Index height × Single-page I/O cost + Number of matching rows × Data page access cost
Cardinality Estimation: Using statistical information to estimate the number of rows after conditional filtering. For example:
- Equality Query (age = 25): If the age column has low cardinality (high value repetition), estimated row count = total rows / column cardinality.
- Range Query (salary > 5000): Use histograms to calculate the frequency of values falling within an interval.

4. Common Estimation Errors and Optimization Strategies

Causes of Errors:
- Data Skew: Histograms cannot fully capture extreme distributions (e.g., 90% of data concentrated in a certain interval).
- Multi-column Correlation: If age and city are statistically independent, a query like age>30 AND city='Beijing' might underestimate the result set (if the two columns are actually highly correlated).
Countermeasures:
- Extended Statistics: Collect statistical information for multi-column combinations (e.g., PostgreSQL's CREATE STATISTICS).
- Dynamic Sampling: Perform real-time sampling on a small amount of data for complex conditions to calibrate estimates (e.g., Oracle's dynamic sampling feature).
- Query Hints: Manually specify indexes or join methods (e.g., FORCE INDEX), but use with caution.

5. Practical Case Analysis

Scenario: Query SELECT * FROM orders WHERE user_id = 100 AND status = 'shipped'.
Optimizer Decision Process:
1. Check individual statistics for user_id and status:
  - user_id has high cardinality (assuming 1 million users), estimated matching rows are few.
  - status has low cardinality (only a few states), proportion of 'shipped' might be high.
2. Without multi-column statistics, the optimizer might incorrectly assume conditional independence, underestimating the actual matching rows (e.g., if that user happens to have mostly shipped orders).
3. If the estimated row count is low, the optimizer chooses an index scan; if high, it might choose a full table scan.
Tuning Method: Create a composite index on (user_id, status) and collect extended statistics to improve estimation accuracy.

Summary:
The accuracy of statistical information and the reasonableness of cost estimation directly determine query performance. By regularly maintaining statistics, using extended statistics to handle column correlations, and dynamically adjusting based on actual execution feedback (e.g., slow query logs), optimizer misjudgments can be effectively avoided, thereby enhancing overall database efficiency.