Statistics and Cost Estimation in Database Query Optimization

Statistics and Cost Estimation in Database Query Optimization

Problem Description
During the database query optimization process, the optimizer needs to estimate the cost of different execution plans based on statistics to select the optimal plan. This problem details the types of statistics, collection mechanisms, and how the optimizer uses statistics to estimate query costs (e.g., selectivity, cardinality, cost models), ultimately completing the selection of an execution plan.

I. Role and Types of Statistics

Role: Statistics are metadata describing the distribution characteristics of data in a table. They help the optimizer estimate the amount of data involved in query operations (such as scans, filters, joins), providing a basis for cost estimation.
Core Types:
- Table-level Statistics: Number of rows (num_rows), number of blocks (num_blocks).
- Column-level Statistics:
  - Number of distinct values (num_distinct), number of nulls (num_nulls).
  - Data distribution histograms (equi-width, equi-height): Record frequency distribution of values to address data skew issues.
- Index Statistics: Index levels, number of leaf blocks, clustering factor (the degree of match between index order and table data order).

II. Collection and Maintenance of Statistics

Collection Methods:
- Manual Commands: Such as Oracle's DBMS_STATS.GATHER_TABLE_STATS, MySQL's ANALYZE TABLE.
- Automated Tasks: The database periodically triggers statistics updates (e.g., when the amount of data modification exceeds a threshold).
Update Strategies:
- Full Collection: High accuracy but resource-intensive, suitable for static tables.
- Incremental Collection: Updates only the changed parts, suitable for large tables.
- Sampling Collection: Randomly samples a portion of the data to calculate statistics, balancing efficiency and accuracy.

III. Key Metrics for Cost Estimation

Selectivity:
- Definition: The ratio of the number of rows remaining after applying query filters to the total number of rows.
- Example Calculation Formulas:
  - Equality Query (col = value): If values are uniformly distributed, selectivity ≈ 1 / num_distinct.
  - Range Query (col > value): Uses histograms or assumes uniform distribution to calculate the proportion.
  - Multiple Condition Combinations:
    - AND Condition: Selectivity = product of the selectivity of each condition.
    - OR Condition: Selectivity = 1 - (1-sel1)(1-sel2)...
Cardinality:
- Definition: The estimated number of rows in the result set output by an operator.
- Formula: Cardinality = Table Row Count × Selectivity.
- Example: A table has 1000 rows, the status column has 10 distinct values, the cardinality for query status='active' ≈ 1000 × (1/10) = 100.

IV. Cost Model and Execution Plan Selection

Cost Components:
- I/O Cost: Number of data block reads (affected by index type, clustering factor).
- CPU Cost: Operations like condition evaluation, sorting.
- Network Cost (for distributed databases).
Optimizer Workflow:
- Step 1: Parse SQL, generate a logical plan (e.g., relational algebra tree).
- Step 2: Use statistics to calculate the cost of each physical operation (full table scan, index scan, nested loop join, etc.).
- Step 3: Select the execution plan with the lowest total cost via dynamic programming or heuristic algorithms.
Example Analysis:
- Query: SELECT * FROM orders WHERE customer_id=100 AND amount>50.
- Optimizer Comparison:
  - Full Table Scan Cost: Directly reads all data blocks.
  - Index Scan Cost: If there is an index on customer_id, first locates rows via the index, then filters amount by accessing the table.
  - Selection Basis: If the selectivity for customer_id=100 is low (i.e., few rows satisfy the condition), index scan cost is lower.

V. Optimization Issues Due to Inaccurate Statistics

Common Scenarios:
- Data Skew: Selectivity estimation for equality queries is inaccurate when histograms are missing.
- Correlated Columns: Such as year=2023 AND month=12. If the optimizer assumes conditions are independent, it underestimates selectivity.
Solutions:
- Extended Statistics: Collect statistics for multi-column combinations or expression statistics.
- Dynamic Sampling: Collect sample data in real-time during execution to correct estimates.
- Hints: Force specify an index or join method (use with caution).

Summary
Statistics are the "eyes" of the query optimizer; their accuracy directly determines the reliability of cost estimation. By regularly collecting statistics, understanding data distribution characteristics, and using histograms to address skew problems, execution plan deviations can be effectively avoided, improving query performance.