Optimization Technique Combining Index Condition Filtering with Bloom Filters in Database Query Execution Plans

Optimization Technique Combining Index Condition Filtering with Bloom Filters in Database Query Execution Plans

1. Description
This is a composite optimization strategy that combines two optimization techniques. In distributed database or large-scale data processing scenarios, when performing multi-table joins (especially between large and small tables), traditional "Index Condition Filtering" may still incur significant I/O overhead due to frequent accesses to the index of the driven table (the large table). A "Bloom Filter" is a probabilistic data structure that can efficiently determine whether an element is "definitely not in a set" or "may be in a set." This technique combines the two, with the core idea being: before the join operation, first construct one or more Bloom Filters using the driving table (usually the smaller table), then "push down" this filter to the index scan or data access process of the driven table (the large table). This allows for rapid filtering of a large number of rows that definitely will not match while reading index entries, thereby significantly reducing invalid index lookups and data block reads, ultimately lowering CPU and I/O load and improving query performance.

2. Technical Details and Step-by-Step Breakdown

Step One: Understanding the Basic Component—Index Condition Filtering

Goal: Evaluate part of the WHERE clause conditions within the index structure itself, using the column information contained in the index, before reading data pages (or rows) from the storage engine. Index entries that do not meet the conditions are not used to retrieve the full data rows.
Process: The query optimizer "pushes" conditions from the WHERE clause that can be satisfied by indexed columns to the index access layer of the storage engine. For example, for the query SELECT * FROM orders WHERE status='shipped' AND amount > 100, if there is a composite index on (status, amount), the storage engine can skip all index entries that do not satisfy status='shipped' AND amount>100 while traversing the index, instead of first fetching all entries with status='shipped' and then filtering for amount in memory.

Step Two: Understanding the Other Basic Component—Bloom Filter

Nature: A data structure consisting of multiple hash functions and a long binary bit array.
Core Operations:
1. Add Element: Use K independent hash functions to map the element to be added (e.g., a join key value) to K specific positions in the bit array and set the bits at those positions to 1.
2. Query Element: Use the same K hash functions to compute the K positions for the element being queried. If all bits at these positions are 1, the answer is "may exist"; if at least one position is 0, the answer is "definitely does not exist".
Characteristics: It has a very low probability of "false positives" (i.e., reporting that an element may exist when it actually does not), but absolutely no "false negatives" (if it reports that an element definitely does not exist, it is truly absent). It offers extremely high space and query efficiency.

Step Three: Problem Scenario and Motivation for Combination

Scenario: A large table fact_sales (fact table, billions of rows) is equi-joined with a small table dim_product (dimension table, thousands of rows) on the join key product_id. The execution plan typically uses the small table as the driving table and performs lookups on the product_id index of the large table.
Pain Points of the Traditional Approach: Even if the dim_product table has only 1000 products, for each row (or data block) of the large table, a lookup on its product_id index is needed to confirm whether its product_id is among those 1000 values. Although index lookups are fast, the overhead of billions of lookups is still enormous, and a large number of non-matching rows trigger useless index traversals and data block loads.
Motivation for Combination: Can we preemptively and batch-exclude rows from the large table whose product_id is definitely not within the range of the small table while scanning the large table's index? Bloom Filters make this possible.

Step Four: Execution Flow of the Combined Optimization Technique
Assuming the query is: SELECT * FROM fact_sales fs JOIN dim_product dp ON fs.product_id = dp.product_id WHERE dp.category='Electronics';

Construction Phase:
- The database first scans the driving table dim_product, filters rows where category='Electronics', and collects the product_id values from these rows (e.g., 500 values).
- The system creates a Bloom Filter using these 500 product_id values as input. The size of the bit array and the number of hash functions are calculated to balance memory usage and the false positive rate.
Push-down and Filtering Phase:
- The optimizer "pushes down" this constructed Bloom Filter to the access operation on the fact_sales table.
- When the storage engine (or executor) scans the index on product_id of the fact_sales table, for each scanned index entry (containing a product_id value), it performs the following:
  a. Bitmap Filtering: Feed the product_id value from the index entry into the pushed-down Bloom Filter for testing.
  b. Fast Decision: If the filter returns "definitely does not exist", this index entry is immediately discarded, and there is no need to read the corresponding data row based on this entry.
  c. Continue Processing: If the filter returns "may exist", this index entry (and its corresponding row) proceeds to the subsequent processing flow (e.g., reading the data row, then performing an exact join match).
Exact Match Phase:
- The rows from the large table that survive the Bloom Filter screening then undergo an exact equi-join operation with the result set from the driving table to eliminate the small number of "false positive" results introduced by the Bloom Filter.

Step Five: Technical Advantages and Cost Analysis

Advantages:
- Significantly Reduces Invalid I/O: Filters out most non-matching join keys at the index layer, avoiding data row reads based on these invalid key values.
- Reduces CPU Overhead: Avoids a large number of useless index key comparisons and the construction of short-lived row objects.
- Particularly Suitable for Distributed Joins: Filters data locally at the node where the data resides (e.g., HDFS data block, storage node) using the Bloom Filter before network transmission, greatly reducing the amount of intermediate data transmitted over the network.
Costs:
- Memory Overhead: Requires maintaining the Bloom Filter bit array in memory.
- CPU Overhead: Requires computing K hash functions for each scanned index entry.
- False Positive Overhead: A small number of rows that are not part of the join result pass through the filter and require elimination in the subsequent exact join, incurring additional cost. However, by reasonably configuring parameters, this cost can be made much smaller than the benefits gained from filtering.

Summary: The optimization combining Index Condition Filtering with Bloom Filters is a powerful embodiment of the "filter early, push down computation" philosophy. By embedding a compact, probabilistic membership test derived from the driving table into the lowest-level data access path (index scan) of the driven table, it achieves large-scale pruning at the very upstream of data flow. This is an extremely efficient runtime optimization technique for handling large-scale data joins, widely used in big data processing systems like Spark, Presto, ClickHouse, as well as in specific scenario optimizations of traditional databases like Oracle and MySQL.