How Database Indexes Work and Optimization Strategies

How Database Indexes Work and Optimization Strategies

Problem Description:
Explain how database indexes (such as B+ tree indexes) improve query efficiency. Illustrate index optimization strategies and their potential issues using specific scenarios.

Step-by-step Explanation:


1. The Basic Role of Indexes

  • Problem Background: When a database table stores a large amount of data, queries without an index require a full table scan (checking each row), resulting in O(N) time complexity and low efficiency.
  • Index Definition: An index is a data structure independent of the data table. It accelerates queries by maintaining a sorted copy of specific fields and mapping them to data locations.
  • Analogy: Similar to a book's table of contents—using the TOC to quickly locate chapter page numbers, avoiding flipping through every page.

2. The Core Structure: B+ Tree

Taking the most common B+ tree index as an example, its design goal is to reduce disk I/O operations (database data is usually stored on disk, which has slow read/write speeds).

B+ Tree Characteristics:

  1. Multi-way Balanced Tree: Each node can contain multiple key values (e.g., 100~1000), resulting in a low tree height (typically 3~4 levels can store billions of records).
  2. Data Stored Only in Leaf Nodes: Internal nodes only store key values and child node pointers, ensuring consistent query path length.
  3. Leaf Nodes Connected by a Bidirectional Linked List: Supports efficient range queries (e.g., WHERE id BETWEEN 100 AND 200).

Query Process Example (Assuming query id=150):

  1. Start from the root node, find the child node according to the key order (e.g., root node keys are [50, 100, 200], then 150 falls into the 100~200 interval).
  2. Traverse downwards level by level until reaching a leaf node.
  3. Locate id=150 within the leaf node and obtain its corresponding data row address (or the data itself if stored directly).
  4. Read the disk data via the address and return the result.
    Key Advantage: Billions of records require only 3~4 disk I/Os, whereas a full table scan might require millions.

3. Index Optimization Strategies

(1)Index Selection Principles

  • High-Frequency Query Fields: Create indexes for fields used in WHERE, JOIN, ORDER BY clauses.
  • High Selectivity Fields: Fields with low duplication rates (e.g., ID number) are more suitable for indexing than those with high duplication (e.g., gender).
  • Composite Index Leftmost Prefix Matching:
    • A composite index (A, B, C) is only effective for queries like:
      • WHERE A=1
      • WHERE A=1 AND B=2
      • WHERE A=1 AND B=2 AND C=3
    • But it cannot accelerate WHERE B=2 or WHERE C=3 (violates the leftmost prefix rule).

(2)Avoiding Common Scenarios of Index Invalidation

  • Operations or Functions on Indexed Columns: E.g., WHERE YEAR(create_time)=2023 cannot use the create_time index; change it to a range query.
  • Wildcard at the Beginning of a LIKE Pattern: LIKE '%abc' cannot use an index, while LIKE 'abc%' can.
  • Type Conversion: E.g., using a numeric type for a string field query (WHERE id='123' vs WHERE id=123 may trigger implicit conversion).

4. Potential Issues with Indexes

  1. Slower Write Operations:
    • Each INSERT/UPDATE/DELETE requires synchronously updating the index, increasing overhead.
    • Suggestion: Balance the number of indexes for tables with frequent read/write operations.
  2. Space Overhead: Indexes require additional storage space (typically 10%~30% of the data size).
  3. Redundant Indexes:
    • For example, creating an (A) index when (A, B) already exists is redundant (as the composite index can cover the single-field query).
    • Tool for checking: Use sys.schema_redundant_indexes (MySQL) to identify redundant indexes.

5. Practical Scenario Analysis

Scenario: A users table has country, city, and age fields. Frequent queries are needed for "adult users in a specific country and city".

  • Optimization Plan:
    1. Create a composite index (country, city, age).
    2. Write the query as:
      SELECT * FROM users  
      WHERE country='China' AND city='Beijing' AND age >= 18;  
      
    3. The index directly locates the rows meeting the conditions, avoiding a full table scan.

Summary: Indexes reduce disk I/O through data structures like B+ trees, but they require rational design to avoid redundancy and inefficiency. In practice, index strategies need to be dynamically adjusted based on query patterns and data distribution.