Statistical Information and Cost Estimation in Database Query Optimization
Topic Description:
The core task of a database query optimizer is to select the optimal execution plan for an SQL query, and statistical information along with cost estimation are the cornerstones of this process. This topic will delve into how statistical information is collected and maintained, how the optimizer uses this information to estimate query cost, and the causes and coping strategies for common estimation errors.
Solution Process:
1. The Core Role of Statistical Information
- Definition: Statistical information is metadata describing the distribution characteristics of data in tables, such as the total number of rows in a table, column cardinality (number of distinct values), data distribution histograms, etc.
- Importance: The optimizer relies on statistical information to estimate the size of the query result set (cardinality) and calculate the cost (e.g., CPU, I/O overhead) of different execution plans, thereby selecting the plan with the lowest cost.
- Example: For a query with
WHERE age > 30, if the optimizer knows from a histogram that 50% of the data is over 30 years old, it can estimate scanning half of the data pages, and then compare the cost of an index scan versus a full table scan.
2. Collection and Maintenance of Statistical Information
- Automatic Collection: Modern databases (such as MySQL/PostgreSQL) support periodic automatic collection of statistical information by sampling data pages to update metadata.
- Manual Trigger: Executing
ANALYZE TABLE(MySQL) orVACUUM ANALYZE(PostgreSQL) can force an update of statistical information. - Update Strategy: Re-collection is triggered when data changes exceed a threshold (e.g., 10% of rows modified) to prevent plan degradation caused by stale statistics.
3. Basic Principles of Cost Estimation
- Cost Model: The optimizer converts operations (such as scans, joins, sorts) into a weighted sum of CPU, memory, and I/O costs. For example:
- Sequential Scan Cost:
Number of table data pages × Single-page I/O cost - Index Scan Cost:
Index height × Single-page I/O cost + Number of matching rows × Data page access cost
- Sequential Scan Cost:
- Cardinality Estimation: Using statistical information to estimate the number of rows after conditional filtering. For example:
- Equality Query (
age = 25): If the age column has low cardinality (high value repetition), estimated row count = total rows / column cardinality. - Range Query (
salary > 5000): Use histograms to calculate the frequency of values falling within an interval.
- Equality Query (
4. Common Estimation Errors and Optimization Strategies
- Causes of Errors:
- Data Skew: Histograms cannot fully capture extreme distributions (e.g., 90% of data concentrated in a certain interval).
- Multi-column Correlation: If
ageandcityare statistically independent, a query likeage>30 AND city='Beijing'might underestimate the result set (if the two columns are actually highly correlated).
- Countermeasures:
- Extended Statistics: Collect statistical information for multi-column combinations (e.g., PostgreSQL's
CREATE STATISTICS). - Dynamic Sampling: Perform real-time sampling on a small amount of data for complex conditions to calibrate estimates (e.g., Oracle's dynamic sampling feature).
- Query Hints: Manually specify indexes or join methods (e.g.,
FORCE INDEX), but use with caution.
- Extended Statistics: Collect statistical information for multi-column combinations (e.g., PostgreSQL's
5. Practical Case Analysis
- Scenario: Query
SELECT * FROM orders WHERE user_id = 100 AND status = 'shipped'. - Optimizer Decision Process:
- Check individual statistics for
user_idandstatus:user_idhas high cardinality (assuming 1 million users), estimated matching rows are few.statushas low cardinality (only a few states), proportion of'shipped'might be high.
- Without multi-column statistics, the optimizer might incorrectly assume conditional independence, underestimating the actual matching rows (e.g., if that user happens to have mostly shipped orders).
- If the estimated row count is low, the optimizer chooses an index scan; if high, it might choose a full table scan.
- Check individual statistics for
- Tuning Method: Create a composite index on
(user_id, status)and collect extended statistics to improve estimation accuracy.
Summary:
The accuracy of statistical information and the reasonableness of cost estimation directly determine query performance. By regularly maintaining statistics, using extended statistics to handle column correlations, and dynamically adjusting based on actual execution feedback (e.g., slow query logs), optimizer misjudgments can be effectively avoided, thereby enhancing overall database efficiency.