SQL Query Optimizer Working Principle Analysis (Advanced)
Today, we delve into the internal workings of the SQL query optimizer, particularly how the Cost-Based Optimizer (CBO) selects the best execution plan.
I. The Core Task of the Optimizer
When a database receives an SQL query, the optimizer must transform the declarative SQL statement into an executable sequence of physical operations. Its core objective is: to find, within a reasonable time, the plan that returns correct results with the lowest resource consumption among millions of possible execution plans.
II. The Three Stages of the Optimization Process
Stage 1: Logical Transformation (Query Rewriting)
- Input: Parse tree of the original SQL statement
- Process: Apply equivalence transformations based on relational algebra rules
- Key Operations:
- View Merging: Expand and merge view definitions into the main query
-- Original: CREATE VIEW v1 AS SELECT * FROM t1 WHERE c1>10; -- SELECT * FROM v1 JOIN t2 ON v1.id=t2.id; -- Rewritten: SELECT * FROM t1 JOIN t2 ON t1.id=t2.id WHERE t1.c1>10;- Predicate Pushdown: Push filtering conditions as close to the data source as possible
- Subquery Unnesting: Transform correlated subqueries into more efficient JOIN operations
Stage 2: Generate Candidate Execution Plans
- Search Space Construction: Consider all possible join orders, access methods, and join algorithms
- Example: Three-Table Join Optimization
Possible plans include:SELECT * FROM t1, t2, t3 WHERE t1.id=t2.id AND t2.id=t3.id;- Join orders: ((t1⨝t2)⨝t3), ((t1⨝t3)⨝t2), ... (6 total)
- Join algorithms: Nested Loop Join, Hash Join, Sort-Merge Join
- Access paths: Full Table Scan, Index Scan, Index Fast Full Scan
Stage 3: Cost Estimation and Plan Selection
This is the most complex decision-making process at the optimizer's core:
1. Cardinality Estimation
- Goal: Predict the number of rows output by each operator
- Methods:
- Use statistics (histograms, number of distinct values (NDV), data distribution)
- For predicate conditions: Apply selectivity estimation
c1 = 100: Selectivity ≈ 1/NDV(c1) (NDV is the number of distinct values)c1 > 100: Use histograms to estimate the proportion
2. Cost Model Calculation
- CPU Cost: Overhead of processing each record
- I/O Cost: Cost of physical/logical reads
- Memory Cost: Memory usage for sort, hash operations
- Formula Example: Index Scan Cost = Index Height + Selectivity × Number of Leaf Blocks
3. Dynamic Programming Search
The optimizer uses a bottom-up dynamic programming approach:
Level 0: Estimate access cost for each table
- t1: Full Table Scan Cost=100, Index Scan Cost=30
- t2: Full Table Scan Cost=200, Index Scan Cost=50
Level 1: Estimate the optimal cost for two-table joins
- (t1⨝t2):
* Nested Loop: 30 + 1000×50 = 50030
* Hash Join: 30 + 50 + Hash Build Cost = 150
- (t2⨝t3): ... (similar calculation)
Level 2: Estimate the optimal cost for three-table joins
- Compare cost of ((t1⨝t2)⨝t3) vs ((t1⨝t3)⨝t2)
III. Practical Case Demonstration
Query Example:
SELECT e.name, d.dname
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000 AND e.hire_date > '2020-01-01';
Optimizer Decision Process:
-
Access Path Selection:
- employees table: Has salary index and hire_date index
- Calculate selectivity: salary>50000 selects 20%, hire_date>2020 selects 30%
- Combined selectivity: 0.2×0.3=6%, estimated to return 6000 records
- Compare Index Scan vs Full Table Scan cost
-
Join Method Selection:
- Nested Loop: Suitable when the driving table's result set is small
- Hash Join: Suitable for equi-joins where both tables are large
- Sort-Merge: Suitable when data is already sorted or sorted output is needed
-
Final Plan Selection:
- Option 1: employees Index Scan → Hash Join departments (Cost: 350)
- Option 2: employees Full Table Scan → Nested Loop Join (Cost: 420)
- Select the lower-cost Option 1
IV. Optimizer Limitations and Coping Strategies
Common Causes of Estimation Errors:
- Data Correlation: e.g., city='Beijing' and country='China' are not independent
- Outdated Statistics: Statistics not updated after data distribution changes
- Complex Expressions: User-defined functions are difficult to estimate for selectivity
Optimization Suggestions:
- Regularly update statistics:
ANALYZE TABLEor database automatic jobs - Use query hints to guide the optimizer:
/*+ INDEX(e salary_idx) */ - Avoid using complex expressions and functions in WHERE clauses
By understanding the working principles of the optimizer, DBAs and developers can write more optimized SQL statements and accurately diagnose root causes during performance issues.