How Database Indexes Work and Optimization Strategies

How Database Indexes Work and Optimization Strategies

Problem Description
A database index is a data structure that improves data retrieval efficiency, similar to a book's table of contents. It establishes a mapping between key values and data locations, reducing the number of disk I/O operations during queries. This topic will delve into the underlying workings of indexes (such as B+ tree structure), the applicable scenarios for different types of indexes, how to optimize query performance through indexes, and how to avoid common pitfalls that cause index inefficiency.

Solution Process and Knowledge Explanation

1. The Core Role of Indexes

  • Problem Scenario: Assuming there are 100,000 records in a table, a query like SELECT * FROM users WHERE age = 25 requires row-by-row scanning (full table scan), which is inefficient.
  • Index Solution: An index creates an independent data structure for the age field, storing age values and pointers (e.g., physical addresses) to the corresponding data rows. The query directly locates rows that meet the condition, avoiding scanning all data.
  • Analogy: A book's table of contents quickly locates chapters via page numbers. An index is like a directory, field values are like keywords, and pointers are like page numbers.

2. The Underlying Structure of Indexes (Taking B+ Tree as an Example)

  • B+ Tree Characteristics:
    • Multi-way Balanced Tree: Each node can contain multiple key values (e.g., [10, 20, 30]), maintaining balanced levels, with stable query time complexity of O(log n).
    • Leaf Nodes Store Data: The leaf layer stores all key values and their pointers, connected into an ordered linked list via pointers (supporting range queries).
    • Non-leaf Nodes Store Only Indexes: Upper-level nodes store only key values and child node pointers, reducing tree height.
  • Query Process Example:
    Assuming a 3-level index tree, querying age=25:
    1. Compare from the root node (e.g., storing [20, 40]): 25 falls within the 20-40 range, proceed to the second-level middle child node.
    2. The second-level node (e.g., [20, 25, 30]) locates the pointer to the leaf node corresponding to 25.
    3. The leaf node directly obtains the data row address, completing the query.
  • Advantage: With 100,000 data entries, a B+ tree requires only 3-4 disk I/O operations (assuming each node stores 1,000 key values), while a full table scan might require hundreds of I/O operations.

3. Index Types and Applicable Scenarios

  • Clustered Index (e.g., InnoDB Primary Key Index):
    • Leaf nodes directly store row data (not pointers), and table data is physically sorted by the primary key.
    • Applicable Scenarios: Primary key queries, range queries (e.g., WHERE id BETWEEN 100 AND 200).
  • Non-clustered Index (Secondary Index):
    • Leaf nodes store primary key values, requiring a back-to-table query: first find the primary key via the secondary index, then retrieve data via the primary key index.
    • Applicable Scenarios: Frequently queried non-primary key fields (e.g., creating an index for the age field).
  • Composite Index (Multi-column Index):
    • The index key includes multiple fields (e.g., (age, name)), sorted according to the leftmost prefix principle (first by age, then by name when ages are equal).
    • Leftmost Prefix Matching Principle: The query condition must include the leftmost column of the composite index (e.g., WHERE age=25 AND name='Alice' can use the index, but WHERE name='Alice' alone cannot).

4. Index Optimization Strategies

  • Guidelines for Creating Indexes:
    • Prioritize High Selectivity Fields: Fields with high distinctiveness (e.g., ID numbers) filter out more data.
    • Avoid Over-indexing: Indexes consume storage space, and insert/update/delete operations require index maintenance, affecting write performance.
  • Common Index Inefficiency Scenarios:
    • Violating the Leftmost Prefix Principle: Skipping the leftmost column in a composite index query.
    • Applying Operations or Functions to Indexed Columns: E.g., WHERE age+1=26 or WHERE UPPER(name)='ALICE'.
    • Leading Wildcard in Fuzzy Queries: E.g., WHERE name LIKE '%abc' (LIKE 'abc%' can use the index).
    • Type Conversion: Using numeric queries on string fields (e.g., WHERE phone=123 where phone is of varchar type).

5. Practical Case Analysis

  • Problem Query: SELECT * FROM orders WHERE user_id=100 AND order_date > '2023-01-01'.
  • Optimization Steps:
    1. Analyze data distribution: user_id has high selectivity, order_date is a range query.
    2. Create a composite index (user_id, order_date):
      • First, exactly match all relevant records via user_id, then filter by order_date within the results.
    3. Avoid Back-to-Table Queries: If only specific fields are needed (e.g., order_id), create a covering index (user_id, order_date, order_id) to retrieve data directly from the index without back-to-table queries.

Through the above steps, indexes optimize queries from full table scans to minimal disk access, significantly improving performance. In practice, it's essential to combine execution plan analysis (e.g., the EXPLAIN command) to verify index usage.