Principles and Optimization of Database Indexes

Principles and Optimization of Database Indexes

I. Concept and Purpose of Indexes
An index is a data structure used in databases to quickly locate data, similar to a book's table of contents. Its core function is to reduce the number of disk I/O operations by pre-organizing data (e.g., sorting) to speed up queries. For example, querying a record in a table without an index may require a full table scan (checking each row), while an index can directly locate the data's position using specific algorithms (such as B+ trees).

II. Underlying Data Structures of Indexes

  1. B+ Tree (Most Commonly Used):

    • Structural Features:
      • All data is stored in leaf nodes; non-leaf nodes only store key values (indexes) and pointers.
      • Leaf nodes are connected via pointers to form an ordered linked list, supporting range queries.
    • Advantages:
      • Low tree height (typically 3-4 layers can store billions of records), reducing disk access frequency.
      • Leaf nodes are ordered, making them suitable for sorting, grouping, and other operations.
  2. Hash Index:

    • Maps key values to storage locations via a hash function, suitable for equality queries (e.g., =), but does not support range queries (e.g., >).

III. Principles for Creating and Using Indexes

  1. Scenarios Suitable for Creating Indexes:

    • Fields frequently used as query conditions (e.g., WHERE user_id = 1001).
    • Fields requiring sorting or grouping (e.g., ORDER BY create_time).
    • Foreign key fields (to ensure efficiency in join queries).
  2. Cases Where Indexes Are Not Suitable:

    • Fields with high data duplication (e.g., gender), where indexes are less effective.
    • Fields updated frequently, as maintaining indexes incurs high costs (e.g., rebalancing B+ trees).

IV. Index Optimization Strategies

  1. Covering Index:

    • The index includes all fields required for the query, eliminating the need for table lookups (e.g., an index (user_id, name) covers the query SELECT name FROM users WHERE user_id=1).
  2. Leftmost Prefix Principle:

    • A composite index (A, B, C) is only effective for queries such as:
      • WHERE A=1
      • WHERE A=1 AND B=2
      • However, it cannot optimize WHERE B=2 (since the leftmost field A is not used).
  3. Common Causes of Index Invalidation:

    • Using functions on indexed fields (e.g., WHERE UPPER(name)='ABC').
    • Fuzzy queries starting with a wildcard (e.g., LIKE '%abc').
    • Implicit data type conversion (e.g., querying a string field with a number).

V. Practical Examples
Assume the table orders has the following structure:

CREATE TABLE orders (  
  order_id INT PRIMARY KEY,  
  user_id INT,  
  status VARCHAR(10),  
  amount DECIMAL(10,2),  
  create_time DATETIME  
);  
  • Scenario 1: Frequently querying orders by user_id:
    CREATE INDEX idx_user_id ON orders(user_id);  
    
  • Scenario 2: Needing to sort by status and create_time:
    CREATE INDEX idx_status_time ON orders(status, create_time);  
    
    The query SELECT * FROM orders WHERE status='paid' ORDER BY create_time DESC can directly utilize the index.

VI. Summary
Indexes improve query efficiency by trading space for time but require a balance between read and write performance. When designing indexes, consider business query patterns to avoid over-indexing (which affects write speed) or missing critical indexes (leading to slow queries).