Principles and Optimization of Database Indexes

Principles and Optimization of Database Indexes

1. Basic Concepts of Indexes
An index is a pre-sorted data structure for one or more columns in a database, similar to a book's table of contents. Its core purpose is to speed up data retrieval, but it increases storage overhead and write operation (insert, delete, update) costs. Common index types include:

  • B+ Tree Index (Most commonly used, suitable for range queries and equality queries)
  • Hash Index (Suitable for equality queries, but does not support range queries)
  • Full-Text Index (Used for keyword search in text)

2. Underlying Structure of Indexes (Using B+ Tree as an Example)
Step 1: Why use B+ tree instead of binary tree?

  • A binary tree can degenerate into a linked list under extreme circumstances (e.g., inserting ordered data), causing query efficiency to degrade from O(log n) to O(n).
  • B+ tree is a multi-way balanced search tree where each node can store multiple key values and pointers. The tree height is lower (typically 3-4 layers can store billions of records), reducing disk I/O operations.

Step 2: Characteristics of B+ Tree

  • Non-leaf nodes only store key values and child node pointers, not actual data;
  • Leaf nodes store all data (or primary key pointers) and are linked in a list, facilitating range queries;
  • All data is stored in leaf nodes, ensuring stable query efficiency (every query must traverse to a leaf node).

3. Index Optimization Strategies
Step 1: Index Selectivity

  • Columns with high selectivity (e.g., unique IDs) are more suitable for indexing. Columns with few duplicate values can filter data faster.
  • Formula: Selectivity = Number of distinct values / Total number of rows. The closer to 1, the better the performance.

Step 2: Leftmost Prefix Principle (For Composite Indexes)

  • A composite index (e.g., (name, age)) can only be matched from left to right.
  • Valid use cases: WHERE name='Alice', WHERE name='Alice' AND age=20;
  • Invalid use case: WHERE age=20 (cannot skip name to use the index directly).

Step 3: Avoiding Common Scenarios of Index Invalidation

  • Applying functions or calculations on indexed columns (e.g., WHERE UPPER(name)='ALICE');
  • Using !=, NOT IN (some databases may perform full table scans);
  • Fuzzy queries starting with a wildcard (e.g., LIKE '%abc').

4. Costs and Use Cases of Indexes

  • Advantages: Significantly speeds up queries, supports sorting and grouping (e.g., ORDER BY on indexed columns can avoid extra sorting).
  • Disadvantages:
    • Occupies disk space;
    • Write operations require updating indexes, slowing down insert/update speeds;
    • Too many indexes increase the query optimizer's selection cost.

5. Practical Recommendations

  • Prioritize creating indexes for frequently queried columns with high filtering capability;
  • Use EXPLAIN to analyze SQL execution plans and observe whether indexes are hit;
  • Regularly clean up redundant indexes (e.g., in MySQL, unused indexes can be viewed via sys.schema_unused_indexes).

By following these steps, you can understand how indexes balance query efficiency and storage overhead, and design indexes appropriately in practical scenarios.