Database Statistics and Query Optimization

Database Statistics and Query Optimization

Topic Description: Database statistics are the key basis for the query optimizer to perform cost estimation and generate efficient execution plans. They systematically collect and store key metrics about data distribution, data volume, etc., within the database, helping the optimizer predict the cost of different query operations. This topic will explain in detail the core content of statistics, their collection and maintenance mechanisms, and their specific application in query optimization.

1. The Importance and Core Content of Statistics

Why are Statistics Needed?
The goal of the query optimizer is to find the execution plan with the lowest "cost," primarily referring to I/O, CPU, and memory consumption. The optimizer requires data characteristics to estimate the cost of different operations (such as full table scans, index scans, joins). Without accurate statistics, the optimizer is like a blind man touching an elephant and may choose inefficient plans.
What Core Content Do Statistics Include?
- Table-Level Statistics:
  - Row Count (Cardinality): The total number of records in the table.
  - Page Count: The number of disk pages occupied by the table data.
- Column-Level Statistics:
  - Number of Distinct Values (NDV): How many unique values are in this column. This is crucial for judging selectivity.
  - Number of NULLs: The count of records where this column is NULL.
  - Data Histogram: The most important tool for describing data distribution. It divides the range of column values into a series of "buckets" and records the value range and frequency within each bucket.
    - Example: For an age column with values from 0 to 100, a histogram might divide it into 10 buckets (0-10, 11-20, ...). Through the histogram, the optimizer can know that records with "age BETWEEN 25 AND 35" constitute approximately 10% of the total, rather than simply guessing (35-25)/(100-0) = 10% (this assumption of uniform distribution leads to large errors when data is skewed).
  - Most Common Values (MCV) and Their Frequencies: For columns with severe data skew (e.g., a status column where 90% of values are "active"), directly recording the most frequently occurring values and their frequencies is more precise than using a histogram.
- Index Statistics:
  - Index Leaf Page Count.
  - Index Height (B-Tree height).

2. Collection and Maintenance of Statistics

How to Collect Statistics?
Typically triggered by specific commands or tasks, for example:
- Manual Collection: Execute commands like ANALYZE TABLE table_name (MySQL), EXEC sp_updatestats (SQL Server), or DBMS_STATS.GATHER_TABLE_STATS (Oracle/PostgreSQL).
- Automatic Collection: Most database systems provide background jobs to automatically update statistics during off-peak hours (e.g., at night).
When Should Statistics Be Updated?
Outdated statistics can cause the optimizer to make incorrect cost estimates when data distribution changes significantly. Key scenarios include:
- After Significant Data Modification: Such as after performing large batches of INSERT, UPDATE, or DELETE operations.
- After Data Loading: Following an ETL process that loads a large amount of new data.
- After Index Creation or Rebuilding.
Sampling Strategy for Statistics:
Performing a full table scan on very large tables to compute precise statistics can be costly. Therefore, databases typically support sampling, analyzing only a portion of the data (e.g., 10% of data blocks) to estimate the overall data distribution, striking a balance between accuracy and performance.

3. Workflow of Statistics in Query Optimization

Let's see how the optimizer uses statistics through a concrete example.

Scenario: Query employees with department ID (dept_id) = 10 and salary > 5000.

SELECT * FROM employees WHERE dept_id = 10 AND salary > 5000;

Assume there are indexes on both dept_id and salary.

Optimizer's Decision Process:

Estimate the Selectivity of Each Condition:
- dept_id = 10:
  - The optimizer queries the statistics for the dept_id column. Assume the table has 10,000 rows and the NDV for dept_id is 10.
  - Base Selectivity = 1 / NDV = 1/10 = 0.1. This means approximately 10,000 * 0.1 = 1,000 rows satisfy dept_id = 10.
  - If MCV exists, the optimizer will directly look up the frequency of dept_id=10, potentially more accurately.
- salary > 5000:
  - The optimizer queries the histogram for the salary column.
  - It finds which bucket the value 5000 falls into. Suppose 5000 falls into the third bucket, which has a range of 4000-6000 and contains 20% of the data.
  - The optimizer estimates the proportion of data where salary > 5000 as (6000-5000)/(6000-4000) * 20% = 10%. (This is a simplified model; actual algorithms are more complex).
  - Therefore, the estimated number of rows satisfying this condition is 10,000 * 10% = 1,000 rows.
Estimate the Combined Condition Selectivity:
- The optimizer assumes the two conditions are independent (another estimation point that may be inaccurate).
- Combined selectivity = 0.1 * 0.1 = 0.01.
- Final estimated result set size = 10,000 * 0.01 = 100 rows.
Compare Costs of Different Execution Plans:
- Plan A: Use dept_id Index
  - Cost ≈ Cost of finding 1,000 dept_id=10 records via index + Cost of 1,000 table lookups to retrieve full rows + Cost of applying salary > 5000 filter in memory to these 1,000 rows.
- Plan B: Use salary Index
  - Cost ≈ Cost of finding 1,000 salary>5000 records via index + Cost of 1,000 table lookups + Cost of filtering for dept_id=10.
- Plan C: Full Table Scan
  - Cost ≈ I/O cost of sequentially reading all data pages of the entire table.
- Using statistics (such as table page count, index height, etc.), the optimizer calculates a numerical cost for each plan. If the table is small, a full table scan might have the lowest cost. If rows with dept_id=10 are very few, Plan A might have the lowest cost.

4. Common Issues and Best Practices

Symptoms of Inaccurate Statistics:
- Unstable query performance, sometimes fast, sometimes slow.
- Obviously unreasonable execution plans (e.g., a large table that should use an index opts for a full table scan, or vice versa).
Best Practices:
- Regular Updates: Ensure statistics are updated after significant data changes.
- Choose Appropriate Sampling Rate: For large tables, use a sufficient sampling rate to ensure accuracy without excessively impacting performance.
- Understand Data Characteristics: For columns with severe data skew, ensure the database collects finer-grained statistics like MCV.

Through the above steps, we can see that statistics are the "eyes" for the query optimizer to make informed decisions. Maintaining accurate and timely statistics is the most fundamental and crucial aspect of database performance tuning.