Principles of SQL Query Optimizer Operation

Principles of SQL Query Optimizer Operation

Problem Description
The SQL query optimizer is a core component of a database management system, responsible for transforming user-submitted SQL statements into an efficient, executable query plan. It analyzes multiple possible execution paths for a query and selects the optimal scheme based on cost estimation, thereby greatly improving query performance. Understanding its working principles is fundamental for efficient SQL programming and database tuning.

Solution Process

Step 1: Understand the Basic Goals and Stages of the Optimizer

Core Goal: The optimizer's sole objective is not to make the query execute "correctly" (this is the parser's task), but to find the "lowest-cost" option among all possible execution plans. Here, "cost" is a comprehensive metric, primarily including disk I/O (number of data page reads), CPU usage (computational load for processing data), and memory usage.
Two Main Phases: The optimization process is typically divided into two stages: logical optimization and physical optimization.
- Logical Optimization: Based on relational algebra and heuristic rules, it performs equivalent transformations on the query's syntax tree to generate a logically equivalent but potentially more efficient query structure. It does not concern itself with the specific storage method of the data.
- Physical Optimization: For the logically optimized query plan, it explores specific physical execution algorithms (e.g., which join algorithm to use, which index) and operation order, estimates the cost of each plan, and ultimately selects the plan with the lowest cost.

Step 2: Delve into Logical Optimization - Query Rewriting

Logical optimization is like a "sentence doctor," "conditioning" the original SQL statement to make its structure healthier. It is based on a series of rules; here are some key rewriting techniques:

Subquery Optimization:
- Scenario: Transforming correlated subqueries (where the subquery depends on values from the outer query) or inefficient subqueries into more efficient JOIN operations.
- Example: SELECT * FROM users WHERE id IN (SELECT user_id FROM orders). The optimizer attempts to rewrite it as SELECT users.* FROM users JOIN orders ON users.id = orders.user_id. JOIN operations are generally much more efficient than row-by-row subquery execution.
Predicate Pushdown:
- Scenario: Filtering data as early as possible to reduce the amount of data that subsequent operations need to handle.
- Example: For the query SELECT * FROM A JOIN B ON A.id = B.aid WHERE A.value > 100. The optimizer attempts to push the filter condition A.value > 100 down to be executed before the JOIN operation. This way, the amount of data from table A participating in the join is significantly reduced first, thereby lowering the join cost.
Expression Simplification and Constant Folding:
- Scenario: Completing all determinable calculations during the query compilation phase.
- Example: WHERE salary > 10000/12 is simplified to WHERE salary > 8333.33. Always-true conditions like WHERE 1=1 are directly removed.

Step 3: Delve into Physical Optimization - Cost Estimation and Plan Selection

Physical optimization is the "decision engine" of the optimizer, responsible for making choices among numerous feasible schemes.

Generate Candidate Execution Plans:
- The optimizer considers different physical algorithms to implement the same logical operation. The most typical are join algorithms:
  - Nested Loop Join: Suitable when one table is very small or there is an index on the join condition.
  - Hash Join: Typically used for processing large amounts of data, equi-joins without indexes; it first builds a hash table in memory for one table, then probes with the other.
  - Sort-Merge Join: May be chosen when the join condition is non-equi or the data is already sorted.
- The optimizer also considers different operation orders, e.g., (A JOIN B) JOIN C and (A JOIN C) JOIN B are two different sequences.
Cost Estimation:
- To compare the quality of different plans, the optimizer needs to estimate the cost of each plan. This relies on statistics maintained by the database.
- Key Statistics Include:
  - Table Statistics: Number of rows in the table, data page usage.
  - Column Statistics: Number of distinct values, data distribution histogram, count of NULL values, etc.
  - Index Statistics: Index levels, number of leaf nodes, etc.
- Example of Estimation Process: For WHERE age > 30, the optimizer looks at the histogram of the age column to estimate the approximate proportion of records where age is greater than 30 relative to the total, thereby deducing how many data pages need to be read.
Select the Final Plan:
- The optimizer uses a cost model to sum up the costs of each step (table scan, filter, join, sort, etc.) of each candidate plan to obtain the total cost.
- It selects the candidate plan with the lowest estimated cost and compiles it into the final, executable query plan.

Step 4: Practical Case Demonstration

Assume a query: SELECT e.name, d.department_name FROM employees e JOIN departments d ON e.dept_id = d.id WHERE e.salary > 50000;

The optimizer's workflow is as follows:

Logical Optimization:
- Inspects the syntax tree and likely applies predicate pushdown to ensure the filter condition e.salary > 50000 is executed before the join.
Physical Optimization - Generate Candidate Plans:
- Candidate Plan A:
  1. Perform a full table scan on employees, filtering out employees with salary > 50000.
  2. For each resulting employee's dept_id, use a nested loop join to look up the corresponding department name via the primary key index on departments.
- Candidate Plan B:
  1. Perform a full table scan on departments.
  2. Perform a full table scan on employees, filtering out employees with salary > 50000.
  3. Perform a hash join on the two result sets.
Physical Optimization - Cost Estimation and Selection:
- The optimizer checks statistics: Assume the employees table has 10,000 people, but only 200 have high salaries (>50000); the departments table has only 10 departments.
- Estimate Plan A Cost: Scanning the employees table (high cost), but filtering leaves only 200 rows. Then perform 200 index lookups (each cost very low). Total cost might be relatively low.
- Estimate Plan B Cost: Requires full table scans of both tables (high cost), then building a hash table in memory for the 10-row department table, followed by scanning the 200-row employee result. Total cost might be higher than Plan A.
- Decision: Due to the small number of high-salary employees and the existence of a primary key index on departments, the optimizer is highly likely to choose Candidate Plan A.

Summary
The SQL query optimizer is a highly complex system that transforms declarative SQL statements into efficient execution engine instructions through logical rewriting and cost-based physical optimization. As developers, understanding its principles helps us write SQL statements that are more "optimizer-friendly," such as creating appropriate indexes and avoiding function calculations on columns in the WHERE clause (which prevents index use), thereby improving application performance from the ground up.