Database Query Execution Plan and Execution Order Analysis

Database Query Execution Plan and Execution Order Analysis

Problem Description
A database query execution plan is a step-by-step blueprint generated by the optimizer for executing an SQL query. Understanding the order of execution (such as which table is accessed first in multi-table JOINs, and how indexes are used) is crucial for performance tuning. This topic will explain how to analyze the actual execution order of a query through the execution plan and interpret the execution logic of common operators (such as Nested Loop, Hash Join).

1. Generation and Representation of Execution Plans

Generation Process: The optimizer generates multiple candidate plans based on the query statement, table statistics, indexes, etc., and selects the plan with the lowest cost.
Representation Form: Usually displayed as a tree structure, where child nodes execute first, and their results are passed to parent nodes. For example:
```
Hash Join
├── Seq Scan on table_A
└── Seq Scan on table_B
```
This indicates a full table scan on table_A and table_B first, followed by a Hash Join.

2. Basic Principles of Execution Order

From Leaves to Root: In the execution plan tree, the lowest-level operators (such as index scans, full table scans) execute first, and intermediate results are gradually passed upward.
Data Flow Direction: The output of child nodes serves as input for parent nodes. For example, in a Nested Loop Join, each row from the outer table (driving table) is matched with all rows from the inner table.

3. Execution Logic of Common Operators

Nested Loop Join:
1. First, execute the outer table (e.g., index scan) to obtain the driving dataset.
2. For each row in the outer table, traverse the inner table (possibly using an index) to match data.
- Characteristics: Suitable for scenarios where the driving table has a small amount of data or the inner table has an efficient index.
Hash Join:
1. First, execute the inner table (right table) scan and build a hash table in memory.
2. Scan the outer table (left table) and probe the hash table for each row to find matches.
- Characteristics: Suitable for scenarios with large table data and no indexes, but requires memory support.
Merge Join:
1. Sort both tables by the join key (skip if already sorted).
2. Use a two-pointer traversal of the sorted datasets to merge matching items.
- Characteristics: Suitable for scenarios where data is already sorted or the query requires sorting.

4. Practical Case Analysis
Assume the query:

SELECT * FROM orders JOIN customers ON orders.customer_id = customers.id
WHERE customers.country = 'US';

A possible execution plan:

Hash Join
├── Seq Scan on customers
│   └── Filter: (country = 'US')
└── Seq Scan on orders

Execution Order:

Perform a full table scan on the customers table and filter out data where country='US'.
Perform a full table scan on the orders table.
Use the filtered results from customers as the inner data for the hash table and orders as the outer data, then execute the Hash Join.

5. How to Verify Execution Order

Use EXPLAIN ANALYZE (e.g., in PostgreSQL) or actual execution plans (e.g., in SQL Server) to view actual execution time and row counts.
Observe the actual rows and loops of operators: In a Nested Loop, the loops value of the inner table usually equals the row count of the outer table.

6. Performance Tuning Insights

If the driving table has too much data, consider adding indexes or conditional filters.
Avoid full table scans on the inner table: Create indexes for join keys or filter conditions.
Pay attention to data skew: In Hash Join, an oversized hash table may cause memory overflow.

By step-by-step analysis of the structure of the execution plan tree and the characteristics of operators, query bottlenecks can be accurately identified and targeted optimizations can be applied.