Optimization Techniques for Correlated Subqueries in Database Query Optimization
1. Knowledge Point Description
A correlated subquery is a query where the execution of the subquery depends on each row of data from the outer query. Unlike ordinary subqueries, a correlated subquery needs to be executed once for every row of the outer query. This "nested loop" execution pattern often leads to poor performance. Optimizing correlated subqueries is a crucial topic in database query optimization.
2. Problem Demonstration
Let's first look at a typical example of a correlated subquery:
SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e1.department_id -- Correlation condition
);
This query aims to find employees whose salary is higher than the average salary of their respective departments.
3. Execution Process Analysis
-
Original Execution Method (Poor performance):
- Iterate through each row e1 of the outer
employeestable. - For each row e1, execute the subquery: calculate the average salary of e1's department.
- Compare e1's salary with this average.
- This N+1 query pattern performs extremely poorly with large datasets.
- Iterate through each row e1 of the outer
-
Performance Bottlenecks:
- The subquery is executed N times (N being the number of employees).
- Each subquery execution requires a full table scan or index lookup.
- A large amount of context switching and repeated calculations.
4. Detailed Optimization Techniques
Technique 1: Rewrite Using a Derived Table
SELECT e.employee_name, e.salary
FROM employees e
INNER JOIN (
SELECT department_id, AVG(salary) as avg_salary
FROM employees
GROUP BY department_id
) dept_avg ON e.department_id = dept_avg.department_id
WHERE e.salary > dept_avg.avg_salary;
Optimization Principle:
- Transform the correlated subquery into a non-correlated subquery.
- Calculate the average salary for all departments first (executed only once).
- Use a JOIN operation for correlation, avoiding N subquery executions.
Execution Steps:
- First, execute the derived table query: calculate the average salary per department.
- Materialize the result as a temporary table
dept_avg. - Execute the main query: JOIN the
employeestable with thedept_avgtable. - Apply the filter condition.
Technique 2: Using Window Functions (Modern Database Support)
SELECT employee_name, salary
FROM (
SELECT employee_name, salary, department_id,
AVG(salary) OVER (PARTITION BY department_id) as dept_avg_salary
FROM employees
) emp_with_avg
WHERE salary > dept_avg_salary;
Optimization Principle:
- Use a window function to calculate the department average salary for all rows in one pass.
- Avoid multiple queries and JOIN operations.
- Often more efficient than the derived table approach.
Technique 3: Using CTE (Common Table Expression)
WITH dept_salaries AS (
SELECT department_id, AVG(salary) as avg_salary
FROM employees
GROUP BY department_id
)
SELECT e.employee_name, e.salary
FROM employees e
INNER JOIN dept_salaries ds ON e.department_id = ds.department_id
WHERE e.salary > ds.avg_salary;
5. Optimization Effectiveness Comparison
Factors for Performance Improvement:
- Reduced Query Count: Decreased from O(N) times to O(1) time.
- Avoided Repeated Calculations: Each department's average salary is calculated only once.
- Better Execution Plan: The optimizer can choose more efficient JOIN algorithms.
- Index Utilization: The derived table approach can better leverage indexes.
Execution Plan Differences:
- Original method: Nested Loop × Filter.
- Optimized method: Hash Join / Hash Aggregate or Window Function.
6. Applicable Scenarios and Limitations
Recommended scenarios for optimization:
- When the outer query result set is large.
- When the subquery itself is complex.
- For report queries that need to process large amounts of data.
Points to Note:
- Ensure that the GROUP BY fields have appropriate indexes.
- Derived tables may create temporary tables; be mindful of memory usage.
- Window functions require database version support.
- The effectiveness of different optimization techniques may vary under different data distributions.
7. Practical Recommendations
- Try the window function approach first (if the database supports it).
- Use CTEs for complex subqueries to improve readability.
- Use EXPLAIN to analyze execution plans and choose the optimal solution.
- Create appropriate indexes on correlation conditions.
These optimization techniques can improve the performance of correlated subqueries by several orders of magnitude, with particularly significant effects in big data scenarios.