Principles and Optimization of Database Indexes

Principles and Optimization of Database Indexes

1. Basic Concepts of Indexes
An index is a pre-sorted data structure for one or more columns in a database, similar to a book's table of contents. Its core purpose is to speed up data retrieval, but it increases storage overhead and write operation (insert, delete, update) costs. Common index types include:

B+ Tree Index (Most commonly used, suitable for range queries and equality queries)
Hash Index (Suitable for equality queries, but does not support range queries)
Full-Text Index (Used for keyword search in text)

2. Underlying Structure of Indexes (Using B+ Tree as an Example)
Step 1: Why use B+ tree instead of binary tree?

A binary tree can degenerate into a linked list under extreme circumstances (e.g., inserting ordered data), causing query efficiency to degrade from O(log n) to O(n).
B+ tree is a multi-way balanced search tree where each node can store multiple key values and pointers. The tree height is lower (typically 3-4 layers can store billions of records), reducing disk I/O operations.

Step 2: Characteristics of B+ Tree

Non-leaf nodes only store key values and child node pointers, not actual data;
Leaf nodes store all data (or primary key pointers) and are linked in a list, facilitating range queries;
All data is stored in leaf nodes, ensuring stable query efficiency (every query must traverse to a leaf node).

3. Index Optimization Strategies
Step 1: Index Selectivity

Columns with high selectivity (e.g., unique IDs) are more suitable for indexing. Columns with few duplicate values can filter data faster.
Formula: Selectivity = Number of distinct values / Total number of rows. The closer to 1, the better the performance.

Step 2: Leftmost Prefix Principle (For Composite Indexes)

A composite index (e.g., (name, age)) can only be matched from left to right.
Valid use cases: WHERE name='Alice', WHERE name='Alice' AND age=20;
Invalid use case: WHERE age=20 (cannot skip name to use the index directly).

Step 3: Avoiding Common Scenarios of Index Invalidation

Applying functions or calculations on indexed columns (e.g., WHERE UPPER(name)='ALICE');
Using !=, NOT IN (some databases may perform full table scans);
Fuzzy queries starting with a wildcard (e.g., LIKE '%abc').

4. Costs and Use Cases of Indexes

Advantages: Significantly speeds up queries, supports sorting and grouping (e.g., ORDER BY on indexed columns can avoid extra sorting).
Disadvantages:
- Occupies disk space;
- Write operations require updating indexes, slowing down insert/update speeds;
- Too many indexes increase the query optimizer's selection cost.

5. Practical Recommendations

Prioritize creating indexes for frequently queried columns with high filtering capability;
Use EXPLAIN to analyze SQL execution plans and observe whether indexes are hit;
Regularly clean up redundant indexes (e.g., in MySQL, unused indexes can be viewed via sys.schema_unused_indexes).

By following these steps, you can understand how indexes balance query efficiency and storage overhead, and design indexes appropriately in practical scenarios.