Principle and Practice of Predicate Pushdown in Database Query Optimization

Principle and Practice of Predicate Pushdown in Database Query Optimization

Description
Predicate pushdown is an important technique in database query optimization. Its core idea is to perform data filtering operations as early as possible. In a complex query (especially statements involving multi-table joins or nested queries), the optimizer attempts to push the filtering conditions (i.e., predicates in the WHERE clause) as far down the query plan as possible, executing them closer to the data source. The purpose of this is to reduce the amount of data that needs to be processed in the early stages of query execution, thereby lowering I/O overhead and the computational cost of subsequent operations (such as joins, aggregations), ultimately improving query performance.

Problem-Solving Process

Understanding the Problem Scenario: Inefficient Query Without Predicate Pushdown
Assume we have a simple query involving two tables: orders and customers.
- The orders table has 1 million records.
- The customers table has 100,000 records.
- We want to query the detailed information of orders from "Shanghai" with an amount greater than 1000, along with the corresponding customer names.
An unoptimized, intuitive query execution plan might look like this (logically, not an actual execution plan):
```
1. Perform a full table scan on `orders`, reading all 1 million records.
2. Perform a full table scan on `customers`, reading all 100,000 records.
3. Perform a `JOIN` operation on these two result sets (totaling 1.1 million records), generating a temporary result set containing all possible combinations (theoretically up to 1 million * 100,000 records, but actually much less due to join constraints, yet still enormous).
4. Apply the `WHERE` condition `customers.city = 'Shanghai' AND orders.amount > 1000` on this huge temporary result set.
5. Finally, return the small number of records that meet the conditions (e.g., a few thousand).
```
Problem Analysis: This execution method is highly inefficient. It performs no filtering before the join operation, resulting in the need to process a large amount of unnecessary data, causing significant I/O and CPU overhead.

Introducing the Solution: The Basic Idea of Predicate Pushdown
The goal of predicate pushdown is to solve the above problem. The optimizer analyzes the SQL statement and attempts to push filtering conditions down to the source of data reading.
For the same query, the logical execution plan after applying predicate pushdown optimization is as follows:

1. When scanning the `orders` table, directly apply the condition `orders.amount > 1000`. Thus, only about 50,000 order records meeting the amount condition might be read from disk into memory, instead of 1 million.
2. When scanning the `customers` table, directly apply the condition `customers.city = 'Shanghai'`. Thus, only about 10,000 Shanghai customer records might be read from disk into memory, instead of 100,000.
3. Perform a `JOIN` operation on these two significantly reduced intermediate result sets (50,000 orders and 10,000 customers).
4. Return the joined result.

Effect Comparison: After predicate pushdown, the data volume involved in the join operation drops from the million level to the ten-thousand level, resulting in an order-of-magnitude improvement in query performance.

In-Depth Principle: Applicable Scenarios and Rules for Predicate Pushdown
Predicate pushdown is not applicable in all situations. The optimizer needs to make judgments based on relational algebra rules and cost estimation.
- Scenario One: Pushdown to Single Table Scan (Most Common)
  This is the most straightforward case, as in the example above. Push the conditions in the WHERE clause that target a single table directly down to that table's scan operator.
- Scenario Two: Pushdown Through Joins (JOIN)
  This is a more complex but also more critical situation. Consider an inner join (INNER JOIN):
```
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE o.amount > 1000 AND c.city = 'Shanghai';
```
  For an inner join, o.amount > 1000 can be safely pushed down before the orders table scan, and c.city = 'Shanghai' can also be safely pushed down before the customers table scan. This is because an inner join only returns rows where both tables satisfy the join condition; filtering out rows that do not meet the WHERE condition early does not affect the final join result.
- Scenario Three: Limitations in Outer Joins (OUTER JOIN)
  Predicate pushdown requires special caution in outer joins because outer joins need to retain all rows from one side (even if there is no match on the other side).
  - Left Outer Join (LEFT JOIN) Example:
```
SELECT * FROM orders o LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.city = 'Shanghai'; -- Predicate on the right table (customers)
```
    Incorrect Pushdown: If c.city = 'Shanghai' is directly pushed down before the customers table scan, then all customers whose city is not 'Shanghai' will be filtered out. This would cause orders that have no matching record in the customers table (which should be retained after the left join with c.* fields as NULL) to be incorrectly excluded from the final result because the right table was filtered early, preventing the generation of NULL rows. This WHERE condition effectively converts the left join into an equivalent inner join.
    Correct Handling: The optimizer typically does not push predicates on the right table down before the right table scan in a left join. Instead, it applies this condition after the join is completed. However, if the predicate is on the left table, it can be pushed down:
```
SELECT * FROM orders o LEFT JOIN customers c ON o.customer_id = c.id
WHERE o.amount > 1000; -- Predicate on the left table (orders)
```
    This o.amount > 1000 can be safely pushed down because it only filters the left table and does not affect the logic of generating NULL rows for the right table.
Practice and Observation: How to Use EXPLAIN for Verification
In actual databases (such as MySQL, PostgreSQL), you can use the EXPLAIN command to view the query execution plan and determine if predicate pushdown has occurred.
- In MySQL:
  Using EXPLAIN FORMAT=JSON provides more detailed information. Observe the execution plan. If you see that in the steps related to a table's access type (table), the attached_condition field shows specific filtering conditions, rather than having a WHERE clause filtering step only after the join operation, it usually indicates that predicate pushdown is in effect.
- In PostgreSQL:
  Use EXPLAIN (ANALYZE, BUFFERS). Observe the output of the execution plan. If for a table's Seq Scan or Index Scan operation, the condition is displayed as Filter: (amount > 1000), and this scan operation is located under a join node, it indicates that the predicate amount > 1000 has been pushed down to the table scan phase for execution.

Summary
Predicate pushdown is a fundamental and powerful technique used by database query optimizers. Its core principle is "filter early." By pushing filtering conditions down to the bottom of the query plan tree, close to the data source, it significantly reduces the size of intermediate result sets, thereby lowering the I/O and computational overhead of the entire query. Understanding how it works, especially its behavioral differences under various join types (inner join vs. outer join), is crucial for writing efficient SQL and performing query performance tuning. Using the EXPLAIN tool provided by databases allows you to visually verify whether the optimizer has applied predicate pushdown.