How Database Indexes Work and Optimization Strategies

How Database Indexes Work and Optimization Strategies

Problem Description
Indexes are a core technology in databases for improving query performance. Their essence is an ordered data structure used to quickly locate data. Interviews often examine the working mechanism of indexes (such as B+ tree structure), index types (clustered/non-clustered indexes, covering indexes, etc.), and how to design and optimize indexes for practical scenarios. A deep understanding of the costs of indexes (degraded write performance, space usage) and optimization principles is required.

Step-by-Step Explanation

1. The Core Role and Costs of Indexes

Role: Similar to a book's table of contents, avoids full table scans, reducing query time complexity from O(n) to O(log n).
Costs:
- Slower Write Operations: Each INSERT/UPDATE/DELETE requires updating the index to maintain data structure order.
- Space Usage: Indexes require additional storage space, e.g., B+ trees contain non-leaf nodes.
Balancing Principle: Index design must weigh the trade-off between query acceleration and write performance/storage costs.

2. Index Data Structures (Using B+ Tree as an Example)

B+ Tree Characteristics:
- A multi-way balanced search tree where all data is stored in leaf nodes, and non-leaf nodes only store key values (reducing tree depth).
- Leaf nodes are linked via pointers, supporting efficient range queries (e.g., WHERE id BETWEEN 10 AND 100).
Search Process (Using equality query id=25 as an example):
1. Start from the root node, compare with key values in order to determine the next child node (e.g., if root keys are [10,30,50], then 25 falls into the 10~30 interval).
2. Traverse downwards level by level until reaching a leaf node, then use binary search within the leaf node to locate the target data or its address.
3. If the target value is not in the leaf node, the data does not exist.

3. Clustered Index vs. Non-Clustered Index

Clustered Index (e.g., InnoDB's primary key index):
- Leaf nodes directly store row data; table data is physically stored in the order of the index.
- Only one clustered index per table (usually the primary key).
Non-Clustered Index (Secondary index):
- Leaf nodes store primary key values (not the actual data). Queries require a table lookup: first query the secondary index to get the primary key, then use the primary key index to fetch the data.
- Example: Table users(id PK, name INDEX, age), query SELECT * FROM users WHERE name='Alice':
  ① Find the primary key value id=5 corresponding to name='Alice' in the name index tree;
  ② Use id=5 to perform a table lookup via the clustered index to get the row data.

4. Covering Index Optimization

Definition: An index contains all fields required by the query, avoiding table lookups.
Example: Query SELECT name FROM users WHERE name='Alice'. If the leaf node of the (name) index already stores the name value, the result can be returned directly.
Optimization Suggestion: Frequently queried fields can be included in the index (e.g., a composite index (name, age) covering SELECT name, age FROM users).

5. Index Invalidation Scenarios and Optimization Strategies

Common Causes of Invalidation:
- Using functions or operations on indexed columns (e.g., WHERE UPPER(name)='ALICE').
- Implicit type conversion (e.g., querying a string column with a number WHERE id='100').
- Leading wildcard in LIKE patterns (LIKE '%abc' cannot leverage index ordering).
- Composite index not adhering to the leftmost prefix principle (for index (a,b,c), if query condition lacks a, the index is ineffective).
Optimization Strategies:
1. Prefix Indexes: Create an index on the first N characters of a long string, balancing selectivity and space (e.g., ALTER TABLE users ADD INDEX (name(10))).
2. Index Condition Pushdown: A feature since MySQL 5.6. It filters by non-indexed conditions earlier during index traversal, reducing table lookups.
3. Index Merge: Scanning multiple single-column indexes separately and merging results (e.g., WHERE a=1 OR b=2), but efficiency is often lower than using a composite index.

6. Practical Principles for Index Design

Prioritize High-Frequency Queries: Create indexes for fields in WHERE, JOIN, and ORDER BY clauses.
Selectivity Principle: Choose columns with high selectivity (high ratio of distinct values). For example, a gender field has low selectivity, making indexes less effective.
Prefer Shorter Fields: Smaller key lengths allow B+ tree nodes to store more keys, reducing tree height.
Avoid Over-Indexing: Numerous indexes increase optimizer selection overhead and hurt write performance.

Through the above steps, one can systematically understand how indexes accelerate queries, common optimization techniques, and design trade-offs, enabling efficient application of indexes in real-world databases.