Subquery Optimization and Rewriting Techniques in Database Query Optimization

Subquery Optimization and Rewriting Techniques in Database Query Optimization

Topic Description
A subquery is a query nested inside another SQL query statement, commonly used for filtering, calculations, or providing data sources. Although subqueries can simplify the expression of complex logic, their execution efficiency is often inferior to equivalent join queries. This topic explores the execution principles of subqueries, common inefficient scenarios, and how to improve performance by rewriting them into join queries or using other techniques.

I. Basic Types and Execution Principles of Subqueries

Scalar Subquery: Returns a single value, typically appears in the SELECT list or WHERE condition
- Example: SELECT name, (SELECT COUNT(*) FROM orders WHERE orders.user_id = users.id) AS order_count FROM users
- Execution: Executes once for each row of the outer query
Correlated Subquery: The subquery references columns from the outer query, forming a correlation
- Example: SELECT * FROM products p1 WHERE price > (SELECT AVG(price) FROM products p2 WHERE p2.category = p1.category)
- Execution Characteristic: Needs to be executed multiple times (related to the number of rows in the outer query)
Non-Correlated Subquery: A subquery that can be executed independently
- Example: SELECT * FROM employees WHERE department_id IN (SELECT id FROM departments WHERE status = 'active')
- Execution Characteristic: Usually needs to be executed only once, and the result can be cached

II. Common Performance Issues with Subqueries

Repeated Execution Problem: Correlated subqueries can lead to the N+1 query problem
- When the outer table has M rows, the correlated subquery may be executed M times
- Solution: Consider rewriting as a JOIN to reduce query count
Excessive Intermediate Result Set: Some subqueries generate large temporary tables
- Example: WHERE id IN (SELECT ...) If the subquery returns a large amount of data, the IN list becomes huge
- Optimization Direction: Use EXISTS instead of IN, or rewrite as JOIN
Preventing Optimizer from Using Indexes: Certain subquery forms restrict index usage
- Example: WHERE column <> (SELECT ...) may prevent index usage
- Optimization Method: Restructure query logic

III. Practical Subquery Optimization Rewrites

Rewriting IN Subquery to EXISTS
- Original Query: SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE amount > 100)
- Rewrite: SELECT * FROM users u WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.amount > 100)
- Advantage: EXISTS returns immediately upon finding a match, avoiding processing all results
Rewriting Correlated Subquery to JOIN
- Original Query: SELECT e.name, (SELECT d.name FROM departments d WHERE d.id = e.department_id) FROM employees e
- Rewrite: SELECT e.name, d.name FROM employees e LEFT JOIN departments d ON e.department_id = d.id
- Advantage: Reduces query count, leverages JOIN's batch processing
Flattening Multi-level Nested Subqueries
- Original Query (Nested IN): SELECT * FROM table1 WHERE col1 IN (SELECT col2 FROM table2 WHERE col3 IN (SELECT col4 FROM table3))
- Rewrite: SELECT DISTINCT t1.* FROM table1 t1 JOIN table2 t2 ON t1.col1 = t2.col2 JOIN table3 t3 ON t2.col3 = t3.col4
- Advantage: Avoids creating multiple temporary tables, gives the optimizer more choices for join order

IV. Special Case Handling Techniques

Optimizing Aggregate Subqueries
- Original Query: SELECT id, (SELECT MAX(score) FROM scores WHERE user_id = users.id) FROM users
- Rewrite: SELECT u.id, MAX(s.score) FROM users u LEFT JOIN scores s ON u.id = s.user_id GROUP BY u.id
- Note: Needs to handle NULL values
Using NOT EXISTS Instead of NOT IN
- When the subquery might return NULL, NOT IN can yield unexpected results
- Safer Writing: SELECT * FROM table1 WHERE NOT EXISTS (SELECT 1 FROM table2 WHERE table2.col = table1.col)

V. Automatic Subquery Optimization by the Optimizer

Modern database optimizers automatically perform some subquery optimizations:

Subquery Unnesting: Converting correlated subqueries into semi-joins
Materialization: Materializing subquery results into temporary tables
However, not all subqueries can be automatically optimized, requiring manual intervention.

Best Practice Recommendations

Analyze the execution plan to confirm if the subquery is a performance bottleneck.
Prioritize rewriting correlated subqueries using JOIN.
Use EXISTS instead of IN for large datasets.
Appropriately use temporary tables or CTEs to decompose complex subqueries.
Create appropriate indexes for the join conditions of subqueries.