Principle Analysis of Dynamic Filtering in Database Query Optimization

Principle Analysis of Dynamic Filtering in Database Query Optimization

I. Topic Description
Dynamic Filtering is a database query optimization technique primarily used in distributed databases or parallel query systems. Its core idea is to dynamically generate filtering conditions at runtime based on partially computed results, reducing the amount of data scanned by other compute nodes or operations. It is commonly used in star schema scenarios (fact table joins with dimension tables) or large table joins, effectively addressing issues of data skew and excessive network transmission overhead in distributed environments.

II. Applicable Scenarios

  1. Star Queries: When a fact table (large table) joins with dimension tables (small tables), first read the filtering conditions from the dimension tables and dynamically apply them during the fact table scan.
  2. Distributed Joins: When the two tables being joined are distributed across different nodes, filtering conditions can be computed on one node first and then broadcast to other nodes for early data filtering.
  3. Parallel Scans: When multiple threads scan partitioned tables, they can share dynamic filtering conditions to reduce redundant scans.

III. Working Principle
The execution process of dynamic filtering can be divided into three stages, explained step-by-step using a typical star query example:

Step 1: Identify the Source of Filtering Conditions
Assume the query is:

SELECT f.order_id, d.date
FROM fact_sales f
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.year = 2023;
  • The dimension table dim_date is small and has a filtering condition d.year = 2023.
  • The fact table fact_sales is huge but has no direct filtering conditions.
    Traditional execution would scan the entire fact table first before joining with the dimension table, resulting in a large amount of ineffective I/O.

Step 2: Runtime Generation of Dynamic Filters

  1. The optimizer recognizes that the dimension table can be filtered early and inserts a "Dynamic Filter Generator" into the query plan.
  2. The execution engine first scans dim_date to obtain a list of all date_id values satisfying year=2023 (e.g., [20230101, 20230102, ...]).
  3. This list is converted into a lightweight filtering structure (e.g., Bloom filter, hash table, or value list) and broadcast to all parallel tasks scanning the fact table.

Step 3: Apply Dynamic Filtering

  1. Fact table scanning tasks receive the filtering structure and check whether f.date_id is in the filter set when reading each row.
  2. If it is not, the row is skipped directly and not passed to subsequent join operations.
  3. This significantly reduces the amount of data involved in the join, lowering network transmission and computational overhead.

IV. Technical Implementation Methods
The implementation of dynamic filtering relies on the runtime optimization capabilities of the database engine, primarily in three forms:

  1. Bloom Filter

    • Maps the key values from the dimension table to a bit array using hash functions, consuming minimal memory.
    • During fact table scanning, the same hash functions are used to check if a key value might exist in the filter (allowing for some false positives but never missing data).
    • Suitable for scenarios with many key values where a small number of false positives is acceptable.
  2. Value List

    • Directly stores a hash set (e.g., HashSet) of the key values from the dimension table.
    • Provides 100% filtering accuracy but memory consumption increases with the number of key values.
    • Suitable for scenarios where the dimension table is very small.
  3. Range Filter

    • If key values form a continuous range (e.g., dates), they can be converted into min-max interval filters.
    • Can be combined with Bloom filters to handle non-continuous values.

V. Optimization Benefits and Costs

  • Benefits:

    • Reduced I/O: Fact table scans skip irrelevant data.
    • Reduced Network Transmission: The amount of filtered data is lower in distributed environments.
    • Improved Cache Efficiency: Less data enters memory for computation after filtering.
  • Costs:

    • Additional Computation: CPU overhead is required to build and check filters.
    • Memory Usage: Filters need to reside in memory, especially for large dimension tables.
    • Synchronization Overhead: All parallel tasks must wait for the filter to be generated before scanning.

VI. Example Demonstration
The following is an example of dynamic filtering in Spark SQL (triggered automatically):

-- Fact table: 1 billion rows, Dimension table: 10,000 rows
SELECT /*+ BROADCAST(d) */ f.*
FROM fact_log f JOIN dim_user d ON f.user_id = d.user_id
WHERE d.country = 'US';

Execution process:

  1. Broadcast the dimension table dim_user to all nodes.
  2. Pre-collect all user_id values for American users and build a Bloom filter.
  3. When each node scans fact_log, use the filter to check f.user_id, skipping data for non-American users.
  4. The amount of data from the fact table participating in the join is reduced from 1 billion rows to approximately 100 million rows (assuming 10% are American users).

VII. Applicability Limitations

  1. The filtered result set from the dimension table must be significantly smaller than the fact table; otherwise, the benefits of filtering may be negative.
  2. In distributed environments, the delay and network cost of filter synchronization must be considered.
  3. For dynamically generated filtering conditions, the accuracy of statistical information must be ensured to avoid plan degradation.

Through dynamic filtering, database systems intelligently "prune" data streams at runtime, transforming runtime information that traditional static optimization cannot handle into filtering capabilities. It is one of the key technologies for distributed query optimization.