Query Rewriting and Equivalence Transformation Techniques in Databases

Query Rewriting and Equivalence Transformation Techniques in Databases

Problem Description
Query rewriting is a crucial step in database query optimization. It refers to the process of transforming a user-submitted SQL statement into a more efficient, equivalent form that is more suitable for the query optimizer to process, without altering the query's semantics. This process achieves optimization at the logical level by applying a series of rules based on relational algebra and database constraints.

Knowledge Explanation

I. Goals and Fundamentals of Query Rewriting

Core Goal: To improve query execution efficiency. The method involves converting the original query into an equivalent query that yields identical results but has a lower execution cost.
Theoretical Basis: Relational algebra. SQL queries can be transformed into relational algebra expressions (e.g., selection σ, projection π, join ⋈). Rewriting involves applying the equivalence laws of relational algebra to transform these expressions.
Key Basis: Database schema information and constraints. This includes primary keys, foreign keys, unique constraints, NOT NULL constraints, CHECK constraints, etc. These constraints provide additional semantic information about the data and are prerequisites for many rewriting rules.

II. Common Query Rewriting Rules and Techniques
We will now explain several core rewriting techniques step by step.

Step 1: Condition Simplification (Predicate Simplification)
This is the most basic rewrite, aiming to reduce computational complexity.

Rule: Simplify Boolean expressions in WHERE/HAVING/ON clauses using logical identities.
Process:
1. Constant Evaluation: Directly compute expressions involving constants.
  - Original query: SELECT * FROM t WHERE salary > 10000 + 5000
  - Rewritten: SELECT * FROM t WHERE salary > 15000
2. NOT Elimination: Apply De Morgan's laws, etc.
  - Original query: SELECT * FROM t WHERE NOT (age > 30 AND dept = 'IT')
  - Rewritten: SELECT * FROM t WHERE age <= 30 OR dept <> 'IT'
3. Constant Propagation: If an equality comparison exists in a condition, propagate the constant to other conditions.
  - Original query: SELECT * FROM t WHERE id = 10 AND name = (SELECT name FROM t WHERE id = 10)
  - Rewritten: SELECT * FROM t WHERE id = 10 AND name = (SELECT name FROM t WHERE id = 10) (It may appear unchanged here, but the optimizer might evaluate the subquery as a constant.)
Effect: Reduces the complexity of conditions that subsequent operations need to handle.

Step 2: View Expansion
If a view is used in the query, its definition needs to be merged into the main query.

Rule: Replace the view name with its SQL definition.
Process:
1. Assume a view: CREATE VIEW v_emp AS SELECT id, name FROM employee WHERE dept = 'Sales'
2. Original query: SELECT name FROM v_emp WHERE id > 100
3. Rewritten (expanded): SELECT name FROM (SELECT id, name FROM employee WHERE dept = 'Sales') AS v_emp WHERE id > 100
4. Further, the optimizer pushes the outer condition id > 100 down into the inner query (see next step "Predicate Pushdown"), resulting in the final rewrite: SELECT name FROM employee WHERE dept = 'Sales' AND id > 100
Effect: Allows the optimizer to perform optimization across the entire query scope, rather than treating the view as a "black box."

Step 3: Predicate Pushdown
This is an extremely important and effective rule, aiming to filter out unnecessary data as early as possible.

Rule: Move selection operations (conditions in the WHERE clause) as close to the data sources as possible; the earlier the filter is applied, the better.
Process:
1. Scenario: Join Queries
  - Original query: SELECT * FROM orders o JOIN customers c ON o.cid = c.id WHERE c.country = 'China'
  - Rewriting idea: Push the condition c.country = 'China' from after the join (filtering after the Join) to before the join, specifically down to the scan of the customers table.
  - Rewritten (logically equivalent): SELECT * FROM orders o JOIN (SELECT * FROM customers WHERE country = 'China') c ON o.cid = c.id
  - Effect: The amount of data (from the customers table) that the join operation needs to process is significantly reduced, greatly improving performance.
2. Scenario: Subqueries/Views: As in the view expansion example above.
Effect: Reduces the size of intermediate result sets that expensive operations like joins and grouping need to handle.

Step 4: Join Order Adjustment (Join Reordering)

Rule: Change the join order of multiple tables in a query based on table sizes and the selectivity of join conditions.
Process:
1. Join operations satisfy associativity and commutativity: (A ⋈ B) ⋈ C ≡ A ⋈ (B ⋈ C)
2. The optimizer's goal is to find a join order that minimizes the size of intermediate result sets.
3. Example:
  - Assume table A is large, B and C are small, and A ⋈ B produces a large result, while B ⋈ C produces a very small result.
  - The original query plan might be (A ⋈ B) ⋈ C.
  - After rewriting, the optimizer might choose A ⋈ (B ⋈ C). Computing the small B ⋈ C first, then joining with the large A, results in lower overall cost.
Effect: Reduces I/O and CPU consumption by minimizing the data volume of intermediate results.

Step 5: Rewriting Using Constraints

Rule: Utilize primary keys, foreign keys, unique constraints, etc., to simplify or alter queries.
Process:
1. Outer Join to Inner Join Conversion:
  - Original query: SELECT A.*, B.* FROM A LEFT JOIN B ON A.id = B.aid WHERE B.score > 90
  - Analysis: The WHERE clause filters on a column from the right table B (B.score > 90). If the B column is NULL (produced by the left join), the condition NULL > 90 evaluates to UNKNOWN, causing that row to be filtered out. This effectively makes the left join behave like an inner join.
  - Prerequisite: Need to confirm that B.aid is a foreign key referencing A.id, or that a NOT NULL constraint exists on B.score, to ensure semantic equivalence.
  - Rewritten: SELECT A.*, B.* FROM A INNER JOIN B ON A.id = B.aid WHERE B.score > 90
  - Effect: Inner joins have more optimization strategies and algorithms available, typically being more efficient than outer joins.
2. Eliminating DISTINCT:
  - Original query: SELECT DISTINCT department_id FROM employees
  - Analysis: If department_id is the primary key of the departments table and is a foreign key in the employees table, but the query originates from the employees table.
  - DISTINCT cannot be directly eliminated here. However, if the query were SELECT DISTINCT d.id FROM departments d JOIN ... and the join conditions guarantee uniqueness, elimination might be possible.
  - A more typical example is when a query contains a GROUP BY clause and the grouping columns are a primary or unique key; the DISTINCT in the SELECT clause might be redundant.
Effect: Converts complex operations into simpler ones or directly eliminates unnecessary operations.

Step 6: Subquery Optimization
Subqueries (especially correlated subqueries) often have poor performance. The rewriting goal is to transform them into more efficient joins (JOIN) or semi-joins (SEMI-JOIN).

Rule: Expand specific forms of subqueries into join operations.
Process:
1. IN Subquery to Semi-Join:
  - Original query: SELECT * FROM products p WHERE p.category_id IN (SELECT id FROM categories WHERE name = 'Electronics')
  - Rewritten (logically equivalent): SELECT p.* FROM products p SEMI JOIN categories c ON p.category_id = c.id WHERE c.name = 'Electronics'
  - Explanation: A semi-join returns only rows from the left table (products) that have matching rows in the right table (categories), and for each row in the left table, it is returned only once even if there are multiple matches in the right table. Modern database optimizers can perform this transformation automatically.
2. EXISTS Subqueries: Optimization is similar to IN subqueries.
3. Scalar Subquery De-correlation:
  - Original query (correlated subquery): SELECT id, name, (SELECT COUNT(*) FROM orders o WHERE o.cid = c.id) AS order_count FROM customers c
  - Rewriting idea: Transform it into an outer join and a grouped query.
  - Rewritten: SELECT c.id, c.name, COALESCE(o_cnt.count, 0) AS order_count FROM customers c LEFT JOIN (SELECT cid, COUNT(*) AS count FROM orders GROUP BY cid) o_cnt ON c.id = o_cnt.cid
  - Effect: Avoids executing the inner query once for each row of the outer query (Nested Loops), instead using more efficient grouping and joins.
Effect: Dramatically improves subquery performance, representing a key area in query optimization.

Summary
Query rewriting is an optimization performed by the database optimizer at the logical plan stage, before generating a physical execution plan. It relies on a solid foundation in relational algebra and knowledge of database constraints. By applying a series of equivalence transformation rules (such as simplification, pushdown, expansion, conversion, etc.), it transforms the original SQL statement into a semantically identical but more efficient form. This process is typically automatic and rule-based, laying a solid foundation for subsequent cost-based optimization. Understanding these techniques helps DBAs and developers write more optimized SQL statements that are better "understood" by the optimizer.