Optimization Techniques for Correlated Subqueries in Database Query Optimization

Optimization Techniques for Correlated Subqueries in Database Query Optimization

1. Knowledge Point Description
A correlated subquery is a query where the execution of the subquery depends on each row of data from the outer query. Unlike ordinary subqueries, a correlated subquery needs to be executed once for every row of the outer query. This "nested loop" execution pattern often leads to poor performance. Optimizing correlated subqueries is a crucial topic in database query optimization.

2. Problem Demonstration
Let's first look at a typical example of a correlated subquery:

SELECT e1.employee_name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.department_id = e1.department_id  -- Correlation condition
);

This query aims to find employees whose salary is higher than the average salary of their respective departments.

3. Execution Process Analysis

  1. Original Execution Method (Poor performance):

    • Iterate through each row e1 of the outer employees table.
    • For each row e1, execute the subquery: calculate the average salary of e1's department.
    • Compare e1's salary with this average.
    • This N+1 query pattern performs extremely poorly with large datasets.
  2. Performance Bottlenecks:

    • The subquery is executed N times (N being the number of employees).
    • Each subquery execution requires a full table scan or index lookup.
    • A large amount of context switching and repeated calculations.

4. Detailed Optimization Techniques

Technique 1: Rewrite Using a Derived Table

SELECT e.employee_name, e.salary
FROM employees e
INNER JOIN (
    SELECT department_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department_id
) dept_avg ON e.department_id = dept_avg.department_id
WHERE e.salary > dept_avg.avg_salary;

Optimization Principle:

  • Transform the correlated subquery into a non-correlated subquery.
  • Calculate the average salary for all departments first (executed only once).
  • Use a JOIN operation for correlation, avoiding N subquery executions.

Execution Steps:

  1. First, execute the derived table query: calculate the average salary per department.
  2. Materialize the result as a temporary table dept_avg.
  3. Execute the main query: JOIN the employees table with the dept_avg table.
  4. Apply the filter condition.

Technique 2: Using Window Functions (Modern Database Support)

SELECT employee_name, salary
FROM (
    SELECT employee_name, salary, department_id,
           AVG(salary) OVER (PARTITION BY department_id) as dept_avg_salary
    FROM employees
) emp_with_avg
WHERE salary > dept_avg_salary;

Optimization Principle:

  • Use a window function to calculate the department average salary for all rows in one pass.
  • Avoid multiple queries and JOIN operations.
  • Often more efficient than the derived table approach.

Technique 3: Using CTE (Common Table Expression)

WITH dept_salaries AS (
    SELECT department_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department_id
)
SELECT e.employee_name, e.salary
FROM employees e
INNER JOIN dept_salaries ds ON e.department_id = ds.department_id
WHERE e.salary > ds.avg_salary;

5. Optimization Effectiveness Comparison

Factors for Performance Improvement:

  1. Reduced Query Count: Decreased from O(N) times to O(1) time.
  2. Avoided Repeated Calculations: Each department's average salary is calculated only once.
  3. Better Execution Plan: The optimizer can choose more efficient JOIN algorithms.
  4. Index Utilization: The derived table approach can better leverage indexes.

Execution Plan Differences:

  • Original method: Nested Loop × Filter.
  • Optimized method: Hash Join / Hash Aggregate or Window Function.

6. Applicable Scenarios and Limitations

Recommended scenarios for optimization:

  • When the outer query result set is large.
  • When the subquery itself is complex.
  • For report queries that need to process large amounts of data.

Points to Note:

  1. Ensure that the GROUP BY fields have appropriate indexes.
  2. Derived tables may create temporary tables; be mindful of memory usage.
  3. Window functions require database version support.
  4. The effectiveness of different optimization techniques may vary under different data distributions.

7. Practical Recommendations

  1. Try the window function approach first (if the database supports it).
  2. Use CTEs for complex subqueries to improve readability.
  3. Use EXPLAIN to analyze execution plans and choose the optimal solution.
  4. Create appropriate indexes on correlation conditions.

These optimization techniques can improve the performance of correlated subqueries by several orders of magnitude, with particularly significant effects in big data scenarios.