Subquery Elimination and Rewriting Techniques in Database Query Optimization
Topic Description
In the process of database query optimization, subqueries (Subqueries) are a common way of writing SQL, but their execution efficiency can be low. The optimizer needs to transform complex subqueries into equivalent and more efficient JOIN operations or other forms. This process is called Subquery Elimination or Subquery Rewriting. This topic requires an in-depth understanding of common types of subqueries, reasons for their inefficiency, and how the optimizer improves query performance through logically equivalent transformations.
Key Knowledge Points
-
Types of Subqueries and Performance Issues
- Correlated Subquery: The subquery depends on column values from the outer query. The subquery is executed once for each row of outer data, leading to performance bottlenecks.
- Non-correlated Subquery: The subquery can be executed independently, but excessive nesting can increase the overhead of temporary result sets.
- Typical inefficient scenarios: Subqueries using
IN,EXISTS,NOT EXISTS, etc., are prone to full table scans or repeated calculations with large data volumes.
-
Core Idea of Subquery Rewriting
Through logically equivalent transformations, subqueries are converted into set operations such as JOIN, Semi-Join, Anti-Join, etc., to reduce the number of query layers and repeated calculations.
Step-by-Step Explanation
Step 1: Analyze Execution Bottlenecks of Subqueries
Example Problem:
SELECT * FROM employees e
WHERE e.salary > (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id);
- Problem Analysis:
This is a typical correlated subquery. For each row in theemployeestable, the subquery needs to calculate the average salary of the department based on the current row'sdept_id. If the table has N rows, the subquery is executed N times, resulting in O(N²) time complexity.
Step 2: Rewrite as a JOIN Operation
- Equivalent Transformation Idea:
Pre-calculate the average salary for each department, then compare it with the employee's salary through a join. - Rewritten SQL:
SELECT e.*
FROM employees e
JOIN (SELECT dept_id, AVG(salary) AS avg_salary
FROM employees GROUP BY dept_id) dept_avg
ON e.dept_id = dept_avg.dept_id
WHERE e.salary > dept_avg.avg_salary;
- Optimization Effect:
The subquery is executed only once, and the comparison is completed in one go via hash join or sort-merge join, reducing the time complexity to O(N).
Step 3: Rewriting EXISTS Subqueries
Example Problem:
SELECT * FROM orders o
WHERE EXISTS (SELECT 1 FROM order_items i WHERE o.order_id = i.order_id AND i.quantity > 10);
- Problem Analysis:
TheEXISTSsubquery needs to check for each order row whether there exists an order item meeting the condition, which is essentially a Semi-Join requirement. - Rewrite as JOIN:
SELECT DISTINCT o.*
FROM orders o
JOIN order_items i ON o.order_id = i.order_id
WHERE i.quantity > 10;
- Notes:
UseDISTINCTto ensure orders are not output repeatedly (as an order may correspond to multiple order items). The optimizer might further optimize this using semi-join algorithms (e.g., MySQL'sSEMI JOIN).
Step 4: Rewriting NOT EXISTS Subqueries
Example Problem:
SELECT * FROM products p
WHERE NOT EXISTS (SELECT 1 FROM inventory i WHERE p.product_id = i.product_id);
- Problem Analysis:
Query for products without inventory, requiring an Anti-Join operation. - Rewrite as LEFT JOIN:
SELECT p.*
FROM products p
LEFT JOIN inventory i ON p.product_id = i.product_id
WHERE i.product_id IS NULL;
- Optimization Principle:
LEFT JOINretains all products, filtering out records with inventory via theNULLcheck. The database might optimize this using hash anti-join algorithms.
Step 5: Decomposition and Merging of Complex Subqueries
Example Problem:
SELECT * FROM customers c
WHERE c.customer_id IN (SELECT customer_id FROM orders WHERE amount > 100)
AND c.customer_id NOT IN (SELECT customer_id FROM returns);
- Step-by-step Rewriting:
- Rewrite the
INsubquery as a JOIN:SELECT DISTINCT c.* FROM customers c JOIN orders o ON c.customer_id = o.customer_id AND o.amount > 100 - Combine with the
NOT EXISTSrewriting technique:SELECT DISTINCT c.* FROM customers c JOIN orders o ON c.customer_id = o.customer_id AND o.amount > 100 LEFT JOIN returns r ON c.customer_id = r.customer_id WHERE r.customer_id IS NULL;
- Rewrite the
Step 6: Automatic Rewriting by the Optimizer
- Modern databases (e.g., Oracle, SQL Server) optimizers automatically attempt subquery rewriting, but note:
- Index Design: Ensure join fields (e.g.,
dept_id,order_id) are indexed. - Statistics Updates: The optimizer relies on statistics to choose between hash join or nested loop join.
- Manual Intervention: Complex queries might require manual rewriting to circumvent optimizer limitations.
- Index Design: Ensure join fields (e.g.,
Summary
The core of subquery elimination is reducing repeated calculations and leveraging the efficiency of set operations. By identifying subquery types and converting them into operations like JOIN, Semi-Join, Anti-Join, etc., query performance can be significantly improved. In practical applications, it is necessary to combine execution plan analysis to ensure the rewritten query genuinely optimizes resource consumption.