Database Statistics and Query Optimization
Topic Description: Database statistics are the key basis for the query optimizer to perform cost estimation and generate efficient execution plans. They systematically collect and store key metrics about data distribution, data volume, etc., within the database, helping the optimizer predict the cost of different query operations. This topic will explain in detail the core content of statistics, their collection and maintenance mechanisms, and their specific application in query optimization.
1. The Importance and Core Content of Statistics
-
Why are Statistics Needed?
The goal of the query optimizer is to find the execution plan with the lowest "cost," primarily referring to I/O, CPU, and memory consumption. The optimizer requires data characteristics to estimate the cost of different operations (such as full table scans, index scans, joins). Without accurate statistics, the optimizer is like a blind man touching an elephant and may choose inefficient plans. -
What Core Content Do Statistics Include?
- Table-Level Statistics:
Row Count (Cardinality): The total number of records in the table.Page Count: The number of disk pages occupied by the table data.
- Column-Level Statistics:
- Number of Distinct Values (NDV): How many unique values are in this column. This is crucial for judging selectivity.
- Number of NULLs: The count of records where this column is NULL.
- Data Histogram: The most important tool for describing data distribution. It divides the range of column values into a series of "buckets" and records the value range and frequency within each bucket.
- Example: For an
agecolumn with values from 0 to 100, a histogram might divide it into 10 buckets (0-10, 11-20, ...). Through the histogram, the optimizer can know that records with "age BETWEEN 25 AND 35" constitute approximately 10% of the total, rather than simply guessing(35-25)/(100-0) = 10%(this assumption of uniform distribution leads to large errors when data is skewed).
- Example: For an
- Most Common Values (MCV) and Their Frequencies: For columns with severe data skew (e.g., a
statuscolumn where 90% of values are "active"), directly recording the most frequently occurring values and their frequencies is more precise than using a histogram.
- Index Statistics:
- Index
Leaf Page Count. - Index
Height (B-Tree height).
- Index
- Table-Level Statistics:
2. Collection and Maintenance of Statistics
-
How to Collect Statistics?
Typically triggered by specific commands or tasks, for example:- Manual Collection: Execute commands like
ANALYZE TABLE table_name(MySQL),EXEC sp_updatestats(SQL Server), orDBMS_STATS.GATHER_TABLE_STATS(Oracle/PostgreSQL). - Automatic Collection: Most database systems provide background jobs to automatically update statistics during off-peak hours (e.g., at night).
- Manual Collection: Execute commands like
-
When Should Statistics Be Updated?
Outdated statistics can cause the optimizer to make incorrect cost estimates when data distribution changes significantly. Key scenarios include:- After Significant Data Modification: Such as after performing large batches of
INSERT,UPDATE, orDELETEoperations. - After Data Loading: Following an ETL process that loads a large amount of new data.
- After Index Creation or Rebuilding.
- After Significant Data Modification: Such as after performing large batches of
-
Sampling Strategy for Statistics:
Performing a full table scan on very large tables to compute precise statistics can be costly. Therefore, databases typically support sampling, analyzing only a portion of the data (e.g., 10% of data blocks) to estimate the overall data distribution, striking a balance between accuracy and performance.
3. Workflow of Statistics in Query Optimization
Let's see how the optimizer uses statistics through a concrete example.
Scenario: Query employees with department ID (dept_id) = 10 and salary > 5000.
SELECT * FROM employees WHERE dept_id = 10 AND salary > 5000;
Assume there are indexes on both dept_id and salary.
Optimizer's Decision Process:
-
Estimate the Selectivity of Each Condition:
dept_id = 10:- The optimizer queries the statistics for the
dept_idcolumn. Assume the table has 10,000 rows and the NDV fordept_idis 10. - Base Selectivity = 1 / NDV = 1/10 = 0.1. This means approximately 10,000 * 0.1 = 1,000 rows satisfy
dept_id = 10. - If MCV exists, the optimizer will directly look up the frequency of
dept_id=10, potentially more accurately.
- The optimizer queries the statistics for the
salary > 5000:- The optimizer queries the histogram for the
salarycolumn. - It finds which bucket the value 5000 falls into. Suppose 5000 falls into the third bucket, which has a range of 4000-6000 and contains 20% of the data.
- The optimizer estimates the proportion of data where
salary > 5000as(6000-5000)/(6000-4000) * 20%= 10%. (This is a simplified model; actual algorithms are more complex). - Therefore, the estimated number of rows satisfying this condition is 10,000 * 10% = 1,000 rows.
- The optimizer queries the histogram for the
-
Estimate the Combined Condition Selectivity:
- The optimizer assumes the two conditions are independent (another estimation point that may be inaccurate).
- Combined selectivity = 0.1 * 0.1 = 0.01.
- Final estimated result set size = 10,000 * 0.01 = 100 rows.
-
Compare Costs of Different Execution Plans:
- Plan A: Use
dept_idIndex- Cost ≈ Cost of finding 1,000
dept_id=10records via index + Cost of 1,000 table lookups to retrieve full rows + Cost of applyingsalary > 5000filter in memory to these 1,000 rows.
- Cost ≈ Cost of finding 1,000
- Plan B: Use
salaryIndex- Cost ≈ Cost of finding 1,000
salary>5000records via index + Cost of 1,000 table lookups + Cost of filtering fordept_id=10.
- Cost ≈ Cost of finding 1,000
- Plan C: Full Table Scan
- Cost ≈ I/O cost of sequentially reading all data pages of the entire table.
- Using statistics (such as table page count, index height, etc.), the optimizer calculates a numerical cost for each plan. If the table is small, a full table scan might have the lowest cost. If rows with
dept_id=10are very few, Plan A might have the lowest cost.
- Plan A: Use
4. Common Issues and Best Practices
-
Symptoms of Inaccurate Statistics:
- Unstable query performance, sometimes fast, sometimes slow.
- Obviously unreasonable execution plans (e.g., a large table that should use an index opts for a full table scan, or vice versa).
-
Best Practices:
- Regular Updates: Ensure statistics are updated after significant data changes.
- Choose Appropriate Sampling Rate: For large tables, use a sufficient sampling rate to ensure accuracy without excessively impacting performance.
- Understand Data Characteristics: For columns with severe data skew, ensure the database collects finer-grained statistics like MCV.
Through the above steps, we can see that statistics are the "eyes" for the query optimizer to make informed decisions. Maintaining accurate and timely statistics is the most fundamental and crucial aspect of database performance tuning.