Database Statistics and Query Optimization

Database Statistics and Query Optimization

Description

Statistics are the core foundation for the database optimizer, used to estimate query cost and select efficient execution plans. They include table row counts, column cardinality (number of distinct values), data distribution histograms, etc. If the statistics are inaccurate, the optimizer might choose a full table scan over an index scan, leading to performance degradation.

Problem-Solving Process

1. Types and Functions of Statistics

Table Statistics:
- Row Count (n_rows): Affects join order and access path selection.
- Number of Data Blocks: Estimates I/O cost.
Column Statistics:
- Cardinality (n_distinct): Number of distinct values, used to estimate selectivity.
- NULL Ratio: Affects the accuracy of conditional filtering.
- Histogram: Describes data distribution, addressing data skew issues (e.g., age column concentrated between 20-30).
Index Statistics:
- Index Levels, Leaf Block Count: Determines index scan cost.

Example:
If WHERE age > 30, and the optimizer finds via the histogram that data over 30 accounts for 10%, it might choose an index scan.

2. Methods for Collecting Statistics

Automatic Collection:
Database triggers periodically (e.g., Oracle's gather_stats_job, MySQL's innodb_stats_auto_recalc).

Manual Collection:
Use commands to update explicitly, for example:

-- MySQL
ANALYZE TABLE users;
-- PostgreSQL
ANALYZE users;
-- Oracle
EXEC DBMS_STATS.GATHER_TABLE_STATS('SCHEMA','USERS');

Key Parameters:
- Sampling Ratio: Higher ratio is more accurate but consumes more resources (e.g., ESTIMATE_PERCENT=>100).
- Histogram Bucket Count: More buckets provide finer distribution detail.

3. How Statistics Affect Query Optimization

The optimizer uses statistics to estimate Selectivity, i.e., the proportion of data remaining after conditional filtering:

Equality Query: WHERE id=100
Selectivity ≈ 1 / n_distinct (assuming uniform data distribution).
Range Query: WHERE age > 30
Calculate proportion via histogram; if data above 30 accounts for 20%, then Selectivity = 0.2.
Multiple Condition Combination:
Use the independence assumption: Selectivity = Selectivity1 × Selectivity2.

Cost Calculation Example:

Full Table Scan Cost = Number of Table Blocks × Single Block I/O Cost
Index Scan Cost = Index Level Traversal + Leaf Block Reads + Table Lookup Cost
If selectivity is low (e.g., 0.1), index scan might be more optimal.

4. Common Scenarios of Inaccurate Statistics and Solutions

Scenario 1: Statistics Not Updated After Drastic Data Changes
- Symptom: Table adds 1 million rows, but statistics show row count = 10k, causing the optimizer to incorrectly choose an index scan.
- Solution: Update statistics periodically or in real-time (e.g., MySQL setting innodb_stats_persistent_sample_pages to increase sampled pages).
Scenario 2: Histogram Invalidation Due to Data Skew
- Symptom: 90% of order statuses are 'Completed'. WHERE status='In Progress' should use an index, but the optimizer misjudges and chooses a full table scan.
- Solution:
  1. Increase histogram bucket count to refine distribution.
  2. Use hints to force index usage (e.g., /*+ INDEX(table_name index_name) */).
Scenario 3: Incorrect Independence Assumption for Multi-Column Correlations
- Symptom: WHERE country='China' AND city='Beijing'. If city is strongly correlated with country, independently calculating selectivity underestimates the result set.
- Solution:
  1. Create extended statistics (e.g., Oracle's Column Group Statistics).
  2. Use dynamic sampling (e.g., Oracle's OPTIMIZER_DYNAMIC_SAMPLING).

5. Practical Recommendations

Frequently Updated Tables: Enable automatic statistics collection and adjust sampling rate to balance efficiency and accuracy.
Complex Queries: Examine execution plans, compare actual vs. estimated row counts. Update statistics if deviation is large.
Data-Skewed Columns: Prioritize creating histograms to avoid misjudgments from the uniform distribution assumption.

Through the above steps, statistics become the "eyes" of the optimizer, ensuring query plans closely align with real data characteristics, thereby improving performance.