Principles and Optimization of Database Indexes
1. Basic Concepts of Indexes
An index is a pre-sorted data structure for one or more columns in a database, similar to a book's table of contents. Its core purpose is to speed up data retrieval, but it increases storage overhead and write operation (insert, delete, update) costs. Common index types include:
- B+ Tree Index (Most commonly used, suitable for range queries and equality queries)
- Hash Index (Suitable for equality queries, but does not support range queries)
- Full-Text Index (Used for keyword search in text)
2. Underlying Structure of Indexes (Using B+ Tree as an Example)
Step 1: Why use B+ tree instead of binary tree?
- A binary tree can degenerate into a linked list under extreme circumstances (e.g., inserting ordered data), causing query efficiency to degrade from O(log n) to O(n).
- B+ tree is a multi-way balanced search tree where each node can store multiple key values and pointers. The tree height is lower (typically 3-4 layers can store billions of records), reducing disk I/O operations.
Step 2: Characteristics of B+ Tree
- Non-leaf nodes only store key values and child node pointers, not actual data;
- Leaf nodes store all data (or primary key pointers) and are linked in a list, facilitating range queries;
- All data is stored in leaf nodes, ensuring stable query efficiency (every query must traverse to a leaf node).
3. Index Optimization Strategies
Step 1: Index Selectivity
- Columns with high selectivity (e.g., unique IDs) are more suitable for indexing. Columns with few duplicate values can filter data faster.
- Formula:
Selectivity = Number of distinct values / Total number of rows. The closer to 1, the better the performance.
Step 2: Leftmost Prefix Principle (For Composite Indexes)
- A composite index (e.g.,
(name, age)) can only be matched from left to right. - Valid use cases:
WHERE name='Alice',WHERE name='Alice' AND age=20; - Invalid use case:
WHERE age=20(cannot skipnameto use the index directly).
Step 3: Avoiding Common Scenarios of Index Invalidation
- Applying functions or calculations on indexed columns (e.g.,
WHERE UPPER(name)='ALICE'); - Using
!=,NOT IN(some databases may perform full table scans); - Fuzzy queries starting with a wildcard (e.g.,
LIKE '%abc').
4. Costs and Use Cases of Indexes
- Advantages: Significantly speeds up queries, supports sorting and grouping (e.g.,
ORDER BYon indexed columns can avoid extra sorting). - Disadvantages:
- Occupies disk space;
- Write operations require updating indexes, slowing down insert/update speeds;
- Too many indexes increase the query optimizer's selection cost.
5. Practical Recommendations
- Prioritize creating indexes for frequently queried columns with high filtering capability;
- Use
EXPLAINto analyze SQL execution plans and observe whether indexes are hit; - Regularly clean up redundant indexes (e.g., in MySQL, unused indexes can be viewed via
sys.schema_unused_indexes).
By following these steps, you can understand how indexes balance query efficiency and storage overhead, and design indexes appropriately in practical scenarios.