Principles and Optimization of Database Indexes

Principles and Optimization of Database Indexes

I. Concept and Purpose of Indexes
An index is a data structure used in databases to quickly locate data, similar to a book's table of contents. Its core function is to reduce the number of disk I/O operations by pre-organizing data (e.g., sorting) to speed up queries. For example, querying a record in a table without an index may require a full table scan (checking each row), while an index can directly locate the data's position using specific algorithms (such as B+ trees).

II. Underlying Data Structures of Indexes

B+ Tree (Most Commonly Used):
- Structural Features:
  - All data is stored in leaf nodes; non-leaf nodes only store key values (indexes) and pointers.
  - Leaf nodes are connected via pointers to form an ordered linked list, supporting range queries.
- Advantages:
  - Low tree height (typically 3-4 layers can store billions of records), reducing disk access frequency.
  - Leaf nodes are ordered, making them suitable for sorting, grouping, and other operations.
Hash Index:
- Maps key values to storage locations via a hash function, suitable for equality queries (e.g., =), but does not support range queries (e.g., >).

III. Principles for Creating and Using Indexes

Scenarios Suitable for Creating Indexes:
- Fields frequently used as query conditions (e.g., WHERE user_id = 1001).
- Fields requiring sorting or grouping (e.g., ORDER BY create_time).
- Foreign key fields (to ensure efficiency in join queries).
Cases Where Indexes Are Not Suitable:
- Fields with high data duplication (e.g., gender), where indexes are less effective.
- Fields updated frequently, as maintaining indexes incurs high costs (e.g., rebalancing B+ trees).

IV. Index Optimization Strategies

Covering Index:
- The index includes all fields required for the query, eliminating the need for table lookups (e.g., an index (user_id, name) covers the query SELECT name FROM users WHERE user_id=1).
Leftmost Prefix Principle:
- A composite index (A, B, C) is only effective for queries such as:
  - WHERE A=1
  - WHERE A=1 AND B=2
  - However, it cannot optimize WHERE B=2 (since the leftmost field A is not used).
Common Causes of Index Invalidation:
- Using functions on indexed fields (e.g., WHERE UPPER(name)='ABC').
- Fuzzy queries starting with a wildcard (e.g., LIKE '%abc').
- Implicit data type conversion (e.g., querying a string field with a number).

V. Practical Examples
Assume the table orders has the following structure:

CREATE TABLE orders (  
  order_id INT PRIMARY KEY,  
  user_id INT,  
  status VARCHAR(10),  
  amount DECIMAL(10,2),  
  create_time DATETIME  
);

Scenario 1: Frequently querying orders by user_id:

CREATE INDEX idx_user_id ON orders(user_id);

Scenario 2: Needing to sort by status and create_time:
```
CREATE INDEX idx_status_time ON orders(status, create_time);  
```
The query SELECT * FROM orders WHERE status='paid' ORDER BY create_time DESC can directly utilize the index.

VI. Summary
Indexes improve query efficiency by trading space for time but require a balance between read and write performance. When designing indexes, consider business query patterns to avoid over-indexing (which affects write speed) or missing critical indexes (leading to slow queries).