Working Principles of Database Query Optimizers and Execution Plan Analysis

Working Principles of Database Query Optimizers and Execution Plan Analysis

Problem Description
The database query optimizer is a core component of relational databases, responsible for converting user-submitted SQL statements into efficient execution plans. Interviewers typically examine: 1. The basic working principles of the optimizer; 2. Methods for interpreting execution plans; 3. The practical application of common optimization strategies. You need to master key concepts such as cost estimation, index selection, and join algorithms.

Knowledge Explanation

  1. Core Tasks of the Optimizer

    • Input: Syntax tree after SQL parsing
    • Output: The lowest-cost execution plan (a collection of query execution steps)
    • Core challenge: Selecting the optimal plan from a vast number of possible execution plans within a limited time
  2. Step-by-Step Analysis of the Optimization Process
    Step 1: Logical Optimization (Rule-Based)

    • Rewrite the query statement, preserving semantics but altering the execution structure
    • Typical operations:
      • Predicate Pushdown: Push filter conditions as close to the data source as possible
        -- Original query: Join first, then filter
        SELECT * FROM A JOIN B ON A.id=B.id WHERE A.name='test';
        -- After optimization: Filter table A first, then join
        
      • Constant Propagation: Simplify expressions using equality conditions
      • Redundant Computation Elimination: Merge duplicate subqueries

    Step 2: Physical Optimization (Cost-Based)

    • Select a physical implementation algorithm for each logical operation:
      • Table Scan Methods: Full table scan vs. Index scan
        • The optimizer estimates costs through data statistics (e.g., cardinality, histograms)
        • Example: WHERE age>30. If 30% of data satisfies, index scan might be better; if 80% satisfies, full table scan is more efficient.
      • Join Algorithm Selection:
        • Nested Loop Join: Suitable for small tables driving large tables, with indexes on the inner table.
        • Hash Join: Suitable for large table equality joins without indexes.
        • Sort-Merge Join: Suitable for already sorted data or when ordered output is required.

    Step 3: Execution Plan Generation

    • Search for the optimal plan tree using dynamic programming or greedy algorithms.
    • Cost model calculation: I/O cost (disk access) + CPU cost (computation processing).
  3. Practical Interpretation of Execution Plans
    Taking MySQL's EXPLAIN output as an example:

    EXPLAIN SELECT o.order_id, c.name 
    FROM orders o JOIN customers c ON o.cust_id=c.id 
    WHERE o.amount > 100;
    
    • type column: Access type (key to performance)
      • const/system: Constant-level optimization
      • ref/range: Index range scan
      • ALL: Full table scan (requires caution)
    • key column: The index actually used
    • rows column: Estimated number of rows to scan (compare with actual row count to judge accuracy)
    • Extra column: Additional information (e.g., "Using where" indicates server-level filtering)
  4. Classic Optimization Case
    Scenario: Multi-table Join Order Selection

    • Rule: Using small tables as the driving table can reduce intermediate result sets.
    • The optimizer needs to calculate:
      • Table size (obtained via STATISTICS table)
      • Selectivity of join conditions (cardinality estimation)
    • Example: A table (1000 rows) JOIN B table (100,000 rows). Prioritizing table A as the driver reduces memory usage.
  5. Advanced Optimization Strategies

    • Adaptive Optimization: e.g., MySQL 8.0 using hash joins instead of nested loops.
    • Parameter Tuning: Adjusting cost_model parameters to control algorithm preference.
    • Hint Statements: Using /*+ INDEX(...) */ to forcibly specify indexes (use with caution).

Summary
The essence of the optimizer is to trade off between accuracy and efficiency. It is necessary to identify bottlenecks by analyzing execution plans and ensure the rationality of decisions by combining statistical information updates (e.g., ANALYZE TABLE). In practical optimization, system-level factors such as lock contention and cache hit rate must also be considered.