Result Set Pushdown and Stored Procedure Execution Optimization in Database Query Execution Plans

Result Set Pushdown and Stored Procedure Execution Optimization in Database Query Execution Plans

Description
"Result Set Pushdown" is an optimization technique used in distributed databases or complex query scenarios. It involves pushing partial result sets (such as intermediate results, aggregate values, etc.) down to the storage layer closer to the data source (e.g., stored procedures, storage engines) for processing. This reduces the overhead of network transmission and data movement, thereby improving query performance. Combined with this, "Stored Procedure Execution Optimization" focuses on executing pre-compiled stored procedure logic on the database server side, leveraging the advantage of local data access to further reduce interactions between the client and server, as well as data processing latency. The combination of both can significantly enhance the execution efficiency of complex queries (such as multi-table joins and aggregate calculations).


Step-by-Step Explanation of the Problem-Solving Process

Step 1: Understand the Basic Scenario and Problem
Assume a distributed database system with two nodes: Node A stores the users table (fields: user_id, city), and Node B stores the orders table (fields: order_id, user_id, amount). The query for "total order amount for users in Beijing" is:

SELECT SUM(o.amount) 
FROM users u JOIN orders o ON u.user_id = o.user_id 
WHERE u.city = 'Beijing';

Traditional execution methods may suffer from two inefficiencies:

  1. Pulling the entire users table to Node B to perform the join and aggregation, resulting in large amounts of irrelevant data transmission.
  2. Pulling the entire orders table to Node A, similarly causing significant network overhead.
    The core problem is the "high cost of data movement."

Step 2: The Principle of Result Set Pushdown
The core idea of result set pushdown is "moving computation closer to the data source." For the above query, the optimizer can generate an execution plan as follows:

  • On Node A (storing the users table), first apply the filter WHERE city='Beijing' to obtain a small result set (only the user_id list of Beijing users).
  • Push this result set (instead of the entire table) down to Node B, where a local join and aggregation with the orders table is performed directly.
    Benefits of this approach:
  • Only the filtered user_id list is transmitted, greatly reducing network traffic.
  • On Node B, local indexes can be leveraged for fast joins, avoiding full table scans.
    Result set pushdown is typically automatically determined by the optimizer. Key conditions include: the pushed-down result set is small, the pushdown can utilize storage-layer indexes, and the storage layer supports the required operations (e.g., joins, aggregations).

Step 3: Integration of Stored Procedure Execution Optimization
For more complex queries (e.g., involving multi-layer filtering, aggregation, or business logic), part of the logic can be encapsulated into stored procedures deployed on the nodes where the data resides. For example:

  1. Create a stored procedure sp_get_beijing_users() on Node A to return the user_id list of Beijing users.
  2. Create a stored procedure sp_sum_amount_by_users(user_id_list) on Node B to receive the user_id list, perform a local join with the orders table, and return the aggregate result.
  3. The client or coordinator node calls these two stored procedures, requiring only the transmission of the user_id list and the final aggregate value.
    Advantages:
  • Stored procedures are pre-compiled, ensuring high execution efficiency.
  • Logic runs on the node where the data resides, reducing data movement.
  • Can be combined with transaction control to ensure consistency.

Step 4: How the Optimizer Makes Decisions and Implements
When generating an execution plan, the optimizer follows these steps:

  1. Cost Estimation: Compare the cost of "traditional execution" versus "pushdown execution." The cost model considers factors such as:
    • Network transmission volume (number of rows × row size).
    • Storage-layer processing capability (availability of indexes, computational resources).
    • Intermediate result set size (estimated via statistics, e.g., selectivity of city='Beijing').
  2. Feasibility Check: Verify whether the storage layer supports pushdown operations. For instance, some distributed database storage nodes support "pushdown joins" or "pushdown aggregations," while simpler storage engines may only support pushdown filtering.
  3. Generate Pushdown Plan:
    • Push filter conditions down to the users table scan phase.
    • Push join operations down to the stored procedure or storage engine layer for execution (e.g., leveraging local foreign key indexes).
    • Push aggregation operations down to data nodes for partial aggregation, followed by final aggregation at the coordinator node.
  4. Stored Procedure Call Optimization: If stored procedures are used, the optimizer may embed stored procedure calls into the execution plan and attempt "parallel pushdown," such as simultaneously invoking stored procedures across multiple nodes for parallel partial aggregation.

Step 5: Breakdown of Example Execution Flow
Using the "total order amount for Beijing users" query as an example, the optimized execution steps are:

  1. The coordinator node parses the SQL and identifies that the users table filter condition can be pushed down.
  2. Send a request to Node A: "Return the user_id list for city='Beijing'."
  3. Node A executes the filter locally and returns the result set R (e.g., 100 user_id entries).
  4. The coordinator node pushes R down to Node B with an instruction: "Calculate the total order amount in the orders table for user_id values in R.
  5. Node B locally looks up rows in the orders table matching R via indexes and calculates the partial sum (if the orders table is partitioned, multiple local aggregations may be involved).
  6. Node B returns the aggregate result (a single value) to the coordinator node.
  7. The coordinator node consolidates the results (in this case, directly returns the value).
    Throughout the process, only 100 user_id entries and one aggregate value are transmitted over the network, far less than transmitting entire tables.

Step 6: Considerations and Limitations

  1. Size of Pushed-Down Result Set: If the filtered result set remains large (e.g., millions of users with city='Beijing'), pushdown may actually be slower due to increased network transmission and storage-layer processing overhead. The optimizer must accurately judge this via selectivity estimation.
  2. Storage Layer Functional Limitations: Not all storage engines support complex pushdown operations (e.g., joins, aggregations). For example, simple key-value stores may only support pushdown key-based queries.
  3. Data Consistency: In distributed transactions, pushdown operations must be combined with mechanisms like two-phase commit to ensure consistency.
  4. Timeliness of Statistics: The optimizer relies on statistics (e.g., histograms for the city column) to estimate result set sizes. Outdated statistics may lead to incorrect decisions.
  5. Stored Procedure Maintenance Cost: Embedding business logic into the database via stored procedures may introduce challenges such as debugging difficulties and version management.

Step 7: Extended Scenarios
Result set pushdown and stored procedure optimization are particularly effective in the following scenarios:

  • Star Schema Queries: Push filter conditions for fact tables down to dimension tables for early filtering, reducing the amount of data for joins.
  • Multi-Layer Aggregation Queries: Perform local aggregation on sharded data first, then global consolidation.
  • ETL Pipelines: Use stored procedures directly at the storage layer for data cleaning and transformation, reducing the number of data landing operations.

By combining result set pushdown with stored procedure optimization, database systems can push computational tasks as close to the data source as possible, reducing unnecessary network and computational overhead, making it especially suitable for data-intensive applications."