Subquery Elimination and Rewriting Techniques in Database Query Optimization

Subquery Elimination and Rewriting Techniques in Database Query Optimization

Topic Description
In the process of database query optimization, subqueries (Subqueries) are a common way of writing SQL, but their execution efficiency can be low. The optimizer needs to transform complex subqueries into equivalent and more efficient JOIN operations or other forms. This process is called Subquery Elimination or Subquery Rewriting. This topic requires an in-depth understanding of common types of subqueries, reasons for their inefficiency, and how the optimizer improves query performance through logically equivalent transformations.

Key Knowledge Points

Types of Subqueries and Performance Issues
- Correlated Subquery: The subquery depends on column values from the outer query. The subquery is executed once for each row of outer data, leading to performance bottlenecks.
- Non-correlated Subquery: The subquery can be executed independently, but excessive nesting can increase the overhead of temporary result sets.
- Typical inefficient scenarios: Subqueries using IN, EXISTS, NOT EXISTS, etc., are prone to full table scans or repeated calculations with large data volumes.
Core Idea of Subquery Rewriting
Through logically equivalent transformations, subqueries are converted into set operations such as JOIN, Semi-Join, Anti-Join, etc., to reduce the number of query layers and repeated calculations.

Step-by-Step Explanation

Step 1: Analyze Execution Bottlenecks of Subqueries
Example Problem:

SELECT * FROM employees e 
WHERE e.salary > (SELECT AVG(salary) FROM employees WHERE dept_id = e.dept_id);

Problem Analysis:
This is a typical correlated subquery. For each row in the employees table, the subquery needs to calculate the average salary of the department based on the current row's dept_id. If the table has N rows, the subquery is executed N times, resulting in O(N²) time complexity.

Step 2: Rewrite as a JOIN Operation

Equivalent Transformation Idea:
Pre-calculate the average salary for each department, then compare it with the employee's salary through a join.
Rewritten SQL:

SELECT e.* 
FROM employees e 
JOIN (SELECT dept_id, AVG(salary) AS avg_salary 
      FROM employees GROUP BY dept_id) dept_avg  
ON e.dept_id = dept_avg.dept_id 
WHERE e.salary > dept_avg.avg_salary;

Optimization Effect:
The subquery is executed only once, and the comparison is completed in one go via hash join or sort-merge join, reducing the time complexity to O(N).

Step 3: Rewriting EXISTS Subqueries
Example Problem:

SELECT * FROM orders o 
WHERE EXISTS (SELECT 1 FROM order_items i WHERE o.order_id = i.order_id AND i.quantity > 10);

Problem Analysis:
The EXISTS subquery needs to check for each order row whether there exists an order item meeting the condition, which is essentially a Semi-Join requirement.
Rewrite as JOIN:

SELECT DISTINCT o.* 
FROM orders o 
JOIN order_items i ON o.order_id = i.order_id 
WHERE i.quantity > 10;

Notes:
Use DISTINCT to ensure orders are not output repeatedly (as an order may correspond to multiple order items). The optimizer might further optimize this using semi-join algorithms (e.g., MySQL's SEMI JOIN).

Step 4: Rewriting NOT EXISTS Subqueries
Example Problem:

SELECT * FROM products p 
WHERE NOT EXISTS (SELECT 1 FROM inventory i WHERE p.product_id = i.product_id);

Problem Analysis:
Query for products without inventory, requiring an Anti-Join operation.
Rewrite as LEFT JOIN:

SELECT p.* 
FROM products p 
LEFT JOIN inventory i ON p.product_id = i.product_id 
WHERE i.product_id IS NULL;

Optimization Principle:
LEFT JOIN retains all products, filtering out records with inventory via the NULL check. The database might optimize this using hash anti-join algorithms.

Step 5: Decomposition and Merging of Complex Subqueries
Example Problem:

SELECT * FROM customers c 
WHERE c.customer_id IN (SELECT customer_id FROM orders WHERE amount > 100) 
  AND c.customer_id NOT IN (SELECT customer_id FROM returns);

Step-by-step Rewriting:

Rewrite the IN subquery as a JOIN:

SELECT DISTINCT c.* 
FROM customers c 
JOIN orders o ON c.customer_id = o.customer_id AND o.amount > 100

Combine with the NOT EXISTS rewriting technique:

SELECT DISTINCT c.* 
FROM customers c 
JOIN orders o ON c.customer_id = o.customer_id AND o.amount > 100 
LEFT JOIN returns r ON c.customer_id = r.customer_id 
WHERE r.customer_id IS NULL;

Step 6: Automatic Rewriting by the Optimizer

Modern databases (e.g., Oracle, SQL Server) optimizers automatically attempt subquery rewriting, but note:
- Index Design: Ensure join fields (e.g., dept_id, order_id) are indexed.
- Statistics Updates: The optimizer relies on statistics to choose between hash join or nested loop join.
- Manual Intervention: Complex queries might require manual rewriting to circumvent optimizer limitations.

Summary
The core of subquery elimination is reducing repeated calculations and leveraging the efficiency of set operations. By identifying subquery types and converting them into operations like JOIN, Semi-Join, Anti-Join, etc., query performance can be significantly improved. In practical applications, it is necessary to combine execution plan analysis to ensure the rewritten query genuinely optimizes resource consumption.