How Database Query Optimizers Work and Execution Plan Analysis

How Database Query Optimizers Work and Execution Plan Analysis

Problem Description

The database query optimizer is a core component of a Database Management System (DBMS). Its task is to translate a user-submitted SQL query into an efficient execution plan. The optimizer needs to select the lowest-cost option from multiple possible execution paths, while considering factors such as table join order, index usage, and data filtering methods. This problem requires a deep understanding of how the optimizer works, including the difference between logical and physical optimization, and how to analyze the execution plan to determine if the optimizer's decisions are sound.

Knowledge Background

Execution Plan: A series of operational steps (such as full table scans, index scans, join algorithms, etc.) that the database performs to execute a query.
Query Cost Model: The optimizer estimates the consumption of resources like CPU, I/O, and memory to compare the cost of different execution plans.

Detailed Solution Steps

Step 1: Understand the Logical Optimization Phase of the Optimizer

Goal: To rewrite the query into a semantically equivalent form, reducing redundant computations.
Common Operations:
- Predicate Pushdown: Push filtering conditions as close to the data source as possible (e.g., filter data before a join).
- Subquery Flattening/Unnesting: Convert certain subqueries into join operations (e.g., changing an IN subquery to a JOIN).
- Constant Folding: Pre-compute constant parts of expressions (e.g., simplifying WHERE id > 10+5 to WHERE id > 15).

Example:

-- Original Query
SELECT * FROM orders
WHERE customer_id IN (SELECT id FROM customers WHERE age > 30);
-- After logical optimization, it might be rewritten as
SELECT orders.* FROM orders JOIN customers ON orders.customer_id = customers.id
WHERE customers.age > 30;

Step 2: Grasp the Core Mechanisms of Physical Optimization

Task: Select specific physical operations for the logical plan (e.g., choosing index types, join algorithms).
Key Decision Points:
- Access Path Selection:
  - Full Table Scan: Suitable for small datasets or when no usable index exists.
  - Index Scan: If the filter condition can leverage an index (e.g., equality query, range query).
- Join Algorithm Selection:
  - Nested Loop Join: Suitable when one side has a small dataset (e.g., small result set from the driving table).
  - Hash Join: Suitable for equality joins when the data can fit entirely into memory.
  - Sort-Merge Join: Suitable for scenarios where data is already sorted or needs sorting.
- Join Order Selection: For multi-table joins, the optimizer tries different order combinations using dynamic programming or greedy algorithms.

Step 3: Learn to Analyze the Execution Plan

Obtaining the Execution Plan:
- MySQL: EXPLAIN FORMAT=TREE SELECT ...
- PostgreSQL: EXPLAIN (ANALYZE, BUFFERS) SELECT ...
- Oracle: EXPLAIN PLAN FOR SELECT ...
Interpreting Key Information:
- Operation Type: Confirm if it's an Index Scan or a Full Table Scan (Seq Scan).
- Row Estimate: Compare the rows (estimated row count) with the actual row count. A large discrepancy may indicate outdated statistics.
- Cost Metrics: Pay attention to the cost value and whether high-cost operations appear (e.g., sorts, temporary tables).
- Extra Information: Note warnings like Using filesort (requires extra sorting) or Using temporary (uses a temporary table).

Step 4: Practice Optimization Through a Case Study

Problem Scenario:

SELECT * FROM users u JOIN orders o ON u.id = o.user_id
WHERE u.city = 'Beijing' AND o.amount > 100;

Optimization Analysis:
1. Check if there are indexes on users.city and orders.amount.
2. Observe the join order: filter users (city condition) first then join with orders, or vice versa?
3. If the execution plan shows a full table scan, consider adding a composite index (e.g., (city, id) for the users table).

Summary

The optimizer's decisions rely on statistics (like table size, data distribution). Statistics need to be updated regularly (e.g., ANALYZE TABLE).
Cost estimation discrepancies in execution plans are common issues. Optimizer choices can be influenced by hints (e.g., FORCE INDEX) or query rewrites.
Complex queries can be simplified by breaking them into temporary tables or using materialized views to reduce optimization difficulty.