SQL Query Optimizer Working Principle Analysis (Advanced)

SQL Query Optimizer Working Principle Analysis (Advanced)

Today, we delve into the internal workings of the SQL query optimizer, particularly how the Cost-Based Optimizer (CBO) selects the best execution plan.

I. The Core Task of the Optimizer
When a database receives an SQL query, the optimizer must transform the declarative SQL statement into an executable sequence of physical operations. Its core objective is: to find, within a reasonable time, the plan that returns correct results with the lowest resource consumption among millions of possible execution plans.

II. The Three Stages of the Optimization Process

Stage 1: Logical Transformation (Query Rewriting)

Input: Parse tree of the original SQL statement
Process: Apply equivalence transformations based on relational algebra rules
Key Operations:
1. View Merging: Expand and merge view definitions into the main query
```
-- Original: CREATE VIEW v1 AS SELECT * FROM t1 WHERE c1>10;
--        SELECT * FROM v1 JOIN t2 ON v1.id=t2.id;
-- Rewritten: SELECT * FROM t1 JOIN t2 ON t1.id=t2.id WHERE t1.c1>10;
```
1. Predicate Pushdown: Push filtering conditions as close to the data source as possible
2. Subquery Unnesting: Transform correlated subqueries into more efficient JOIN operations

Stage 2: Generate Candidate Execution Plans

Search Space Construction: Consider all possible join orders, access methods, and join algorithms
Example: Three-Table Join Optimization
```
SELECT * FROM t1, t2, t3 
WHERE t1.id=t2.id AND t2.id=t3.id;
```
Possible plans include:
- Join orders: ((t1⨝t2)⨝t3), ((t1⨝t3)⨝t2), ... (6 total)
- Join algorithms: Nested Loop Join, Hash Join, Sort-Merge Join
- Access paths: Full Table Scan, Index Scan, Index Fast Full Scan

Stage 3: Cost Estimation and Plan Selection
This is the most complex decision-making process at the optimizer's core:

1. Cardinality Estimation

Goal: Predict the number of rows output by each operator
Methods:
- Use statistics (histograms, number of distinct values (NDV), data distribution)
- For predicate conditions: Apply selectivity estimation
  - c1 = 100: Selectivity ≈ 1/NDV(c1) (NDV is the number of distinct values)
  - c1 > 100: Use histograms to estimate the proportion

2. Cost Model Calculation

CPU Cost: Overhead of processing each record
I/O Cost: Cost of physical/logical reads
Memory Cost: Memory usage for sort, hash operations
Formula Example: Index Scan Cost = Index Height + Selectivity × Number of Leaf Blocks

3. Dynamic Programming Search
The optimizer uses a bottom-up dynamic programming approach:

Level 0: Estimate access cost for each table
  - t1: Full Table Scan Cost=100, Index Scan Cost=30
  - t2: Full Table Scan Cost=200, Index Scan Cost=50

Level 1: Estimate the optimal cost for two-table joins
  - (t1⨝t2):
    * Nested Loop: 30 + 1000×50 = 50030
    * Hash Join: 30 + 50 + Hash Build Cost = 150
  - (t2⨝t3): ... (similar calculation)

Level 2: Estimate the optimal cost for three-table joins
  - Compare cost of ((t1⨝t2)⨝t3) vs ((t1⨝t3)⨝t2)

III. Practical Case Demonstration

Query Example:

SELECT e.name, d.dname 
FROM employees e 
JOIN departments d ON e.dept_id = d.id 
WHERE e.salary > 50000 AND e.hire_date > '2020-01-01';

Optimizer Decision Process:

Access Path Selection:
- employees table: Has salary index and hire_date index
- Calculate selectivity: salary>50000 selects 20%, hire_date>2020 selects 30%
- Combined selectivity: 0.2×0.3=6%, estimated to return 6000 records
- Compare Index Scan vs Full Table Scan cost
Join Method Selection:
- Nested Loop: Suitable when the driving table's result set is small
- Hash Join: Suitable for equi-joins where both tables are large
- Sort-Merge: Suitable when data is already sorted or sorted output is needed
Final Plan Selection:
- Option 1: employees Index Scan → Hash Join departments (Cost: 350)
- Option 2: employees Full Table Scan → Nested Loop Join (Cost: 420)
- Select the lower-cost Option 1

IV. Optimizer Limitations and Coping Strategies

Common Causes of Estimation Errors:

Data Correlation: e.g., city='Beijing' and country='China' are not independent
Outdated Statistics: Statistics not updated after data distribution changes
Complex Expressions: User-defined functions are difficult to estimate for selectivity

Optimization Suggestions:

Regularly update statistics: ANALYZE TABLE or database automatic jobs
Use query hints to guide the optimizer: /*+ INDEX(e salary_idx) */
Avoid using complex expressions and functions in WHERE clauses

By understanding the working principles of the optimizer, DBAs and developers can write more optimized SQL statements and accurately diagnose root causes during performance issues.