Principle of B+ Tree Index and Its Application in Databases

Principle of B+ Tree Index and Its Application in Databases

Problem Description: B+ tree is the most commonly used data structure for database indexing. Please explain the basic structure and characteristics of the B+ tree, and explain why database indexes generally adopt B+ trees instead of binary search trees or B-trees.

Solution Process:

Step 1: Understand the basic requirements of indexing
The core goal of a database index is to achieve fast data retrieval, requiring support for:

Efficient point queries (equality queries)
Efficient range queries (e.g., WHERE id > 100)
Ordered storage of data
Adaptation to disk I/O characteristics (reducing disk access times)

Step 2: Limitations of binary search trees
Binary search trees are highly efficient in memory (O(log n)), but pose problems as disk indexes:

Each node stores only one key-value pair, resulting in greater tree height
Range queries require multiple in-order traversals, generating significant random I/O
Node storage is not compact, unable to fully utilize disk block space

Step 3: Basic structure of B-tree
B-tree is a multi-way balanced search tree that addresses the height issue of binary search trees:

Each node can contain multiple key-values and child node pointers
An m-order B-tree node can have up to m child nodes
All leaf nodes are at the same level, maintaining balance

Step 4: Improved features of B+ tree
B+ tree introduces key optimizations based on the B-tree:

Data is stored only in leaf nodes: Internal nodes store only key-values and child node pointers
Leaf nodes are linked via pointers: Forming an ordered linked list structure
Non-leaf node key-values reappear: They appear again in the leaf nodes

Step 5: Detailed structure analysis of B+ tree
Taking a 3-order B+ tree as an example:

Internal nodes have at most 2 key-values and 3 child node pointers
Leaf nodes store actual data records or pointers to data
Leaf nodes are connected via bidirectional pointers

Example structure:

        [50]
       /     \
  [30,40]   [60,70]
  /  |  \   /  |  \
[10,20]->[30,40]->[50]->[60,70]->[80,90]

Step 6: Detailed advantages of B+ tree

Better suited for disk I/O:
- Each node size is set to the disk page size (e.g., 4KB)
- One I/O operation can read more key-values, reducing tree height
- A 3-level B+ tree can support millions of records (assuming 100 key-values per node)
Extremely efficient for range queries:
- After finding the starting point, scan sequentially along the leaf node linked list
- Avoids backtracking to parent nodes, reducing random I/O
Stable query performance:
- All queries must reach leaf nodes, resulting in equal path lengths
- More consistent performance

Step 7: Comparison with B-tree
Main advantages of B+ tree over B-tree:

Smaller internal nodes: More internal nodes can be cached in memory
Better for range queries: B-trees require in-order traversal, while B+ trees allow direct linked list traversal
More efficient full table scans: B+ trees only need to traverse the leaf node linked list

Step 8: Application in actual databases
In MySQL's InnoDB storage engine:

Clustered index leaf nodes store complete data rows
Secondary index leaf nodes store primary key values
Complete data is retrieved via primary key value lookups (table lookups)

This design ensures ordered data storage and efficient queries, serving as the core implementation method for relational database indexing.