Principles of Statistics and Cost Estimation in Database Query Optimization

Principles of Statistics and Cost Estimation in Database Query Optimization

Problem Description

During the database query optimization process, the optimizer needs to select the optimal plan from multiple possible execution plans. This selection relies on Statistics and Cost Estimation. This article details how statistics are collected and stored, how the optimizer uses statistics to estimate query costs, and the principles of common estimation models (such as selectivity and cardinality estimation).

1. The Core Role of Statistics

Statistics are summaries of data distribution within database tables, such as:

Table-level information: Number of rows, number of data pages;
Column-level information: Number of distinct values (NDV), proportion of null values, data distribution (histogram);
Index information: Index levels, number of leaf nodes, etc.

The optimizer uses statistics to estimate the "cost" of a query (e.g., CPU, I/O, memory overhead), thereby comparing the efficiency of different execution plans.

2. Collection and Storage of Statistics

(1) Collection Methods

Automatic Collection: Periodically triggered by the database (e.g., MySQL's auto_increment statistics, PostgreSQL's autovacuum);
Manual Commands: Such as ANALYZE TABLE (MySQL) or DBMS_STATS.GATHER_TABLE_STATS (Oracle).

(2) Key Statistical Metrics

Cardinality: The number of distinct values (NDV) in a column. For example, the NDV for a gender column is typically 2;
Histogram: Describes data distribution, especially crucial for non-uniform data (e.g., age, income):
- Equal-width histogram: Divides data into buckets by value range, recording the frequency of each bucket;
- Equal-height histogram: Divides data into buckets by frequency, so each bucket contains approximately the same number of rows.
Correlation: Estimates the selectivity of multi-column joint conditions (e.g., WHERE city='Beijing' AND salary>10000).

3. Basic Principles of Cost Estimation

(1) Selectivity

Selectivity refers to the proportion of rows remaining after a condition is applied, relative to the total number of rows. For example:

WHERE id=10: If id is unique, selectivity ≈ 1/NDV;
WHERE age>30: Determined via histogram as the proportion of values above 30;
Multi-condition selectivity: Assuming conditions are independent, joint selectivity = product of individual selectivities (actual may deviate due to correlations).

(2) Cardinality Estimation

Estimated number of rows in query result = Total table rows × Selectivity. For example:

Table users has 10,000 rows, column status has NDV=4 (values uniformly distributed), then the cardinality for WHERE status=1 ≈ 10,000 × 1/4 = 2,500 rows.

(3) Cost Model

The optimizer converts operations (e.g., index scan, full table scan) into cost units:

I/O Cost: The cost of reading data pages (related to the number of pages);
CPU Cost: The cost of processing rows (e.g., comparisons, sorting).
Total Cost = I/O Cost + CPU Cost.

4. Practical Example Analysis

Assume the table employees has 10,000 rows, with the following statistics:

Column department has NDV=10, histogram shows uniform distribution;
Column salary has a minimum of 3000, a maximum of 80,000, histogram divided into 5 buckets.

Query:

SELECT * FROM employees
WHERE department = 'HR' AND salary > 50000;

Step 1: Calculate Single-Condition Selectivity

department='HR': Selectivity ≈ 1/10 = 0.1;
salary>50000: Histogram shows 20% of data is above 50,000, selectivity = 0.2.

Step 2: Joint Selectivity
Assuming conditions are independent, joint selectivity ≈ 0.1 × 0.2 = 0.02.

Step 3: Cardinality Estimation
Estimated row count = 10,000 × 0.02 = 200 rows.

Step 4: Compare Execution Plan Costs

Full Table Scan: Cost ≈ I/O for reading 10,000 rows + CPU for filtering 10,000 rows;
Index Scan (assuming an index on (department, salary)): Cost ≈ I/O for index range scan + I/O for table lookup of 200 rows.
The optimizer will select the plan with the lower cost.

5. Common Issues and Optimization of Statistics

Outdated Statistics: Statistics are not refreshed promptly after frequent data updates, leading to estimation deviations (e.g., actual row count is much higher than estimated);
Data Correlation Bias: For example, country='China' AND city='Beijing' has strong correlation, and the independence assumption will underestimate selectivity;
Solutions:
- Regularly update statistics (especially after large-scale data changes);
- Use dynamic sampling (e.g., Oracle's dynamic statistics) to supplement statistics in real-time;
- Create extended statistics (e.g., Oracle's column groups) to capture multi-column correlations.

Summary

Statistics and cost estimation are the "eyes" of the query optimizer. Their accuracy directly determines the quality of execution plans. Understanding principles such as histograms, selectivity, and cardinality estimation helps improve query performance by optimizing statistics collection strategies or using hints to influence optimizer decisions.