Principle of B+ Tree Index and Its Application in Databases

Principle of B+ Tree Index and Its Application in Databases

Problem Description: B+ tree is the most commonly used data structure for database indexing. Please explain the basic structure and characteristics of the B+ tree, and explain why database indexes generally adopt B+ trees instead of binary search trees or B-trees.

Solution Process:

Step 1: Understand the basic requirements of indexing
The core goal of a database index is to achieve fast data retrieval, requiring support for:

  • Efficient point queries (equality queries)
  • Efficient range queries (e.g., WHERE id > 100)
  • Ordered storage of data
  • Adaptation to disk I/O characteristics (reducing disk access times)

Step 2: Limitations of binary search trees
Binary search trees are highly efficient in memory (O(log n)), but pose problems as disk indexes:

  • Each node stores only one key-value pair, resulting in greater tree height
  • Range queries require multiple in-order traversals, generating significant random I/O
  • Node storage is not compact, unable to fully utilize disk block space

Step 3: Basic structure of B-tree
B-tree is a multi-way balanced search tree that addresses the height issue of binary search trees:

  • Each node can contain multiple key-values and child node pointers
  • An m-order B-tree node can have up to m child nodes
  • All leaf nodes are at the same level, maintaining balance

Step 4: Improved features of B+ tree
B+ tree introduces key optimizations based on the B-tree:

  1. Data is stored only in leaf nodes: Internal nodes store only key-values and child node pointers
  2. Leaf nodes are linked via pointers: Forming an ordered linked list structure
  3. Non-leaf node key-values reappear: They appear again in the leaf nodes

Step 5: Detailed structure analysis of B+ tree
Taking a 3-order B+ tree as an example:

  • Internal nodes have at most 2 key-values and 3 child node pointers
  • Leaf nodes store actual data records or pointers to data
  • Leaf nodes are connected via bidirectional pointers

Example structure:

        [50]
       /     \
  [30,40]   [60,70]
  /  |  \   /  |  \
[10,20]->[30,40]->[50]->[60,70]->[80,90]

Step 6: Detailed advantages of B+ tree

  1. Better suited for disk I/O:

    • Each node size is set to the disk page size (e.g., 4KB)
    • One I/O operation can read more key-values, reducing tree height
    • A 3-level B+ tree can support millions of records (assuming 100 key-values per node)
  2. Extremely efficient for range queries:

    • After finding the starting point, scan sequentially along the leaf node linked list
    • Avoids backtracking to parent nodes, reducing random I/O
  3. Stable query performance:

    • All queries must reach leaf nodes, resulting in equal path lengths
    • More consistent performance

Step 7: Comparison with B-tree
Main advantages of B+ tree over B-tree:

  • Smaller internal nodes: More internal nodes can be cached in memory
  • Better for range queries: B-trees require in-order traversal, while B+ trees allow direct linked list traversal
  • More efficient full table scans: B+ trees only need to traverse the leaf node linked list

Step 8: Application in actual databases
In MySQL's InnoDB storage engine:

  • Clustered index leaf nodes store complete data rows
  • Secondary index leaf nodes store primary key values
  • Complete data is retrieved via primary key value lookups (table lookups)

This design ensures ordered data storage and efficient queries, serving as the core implementation method for relational database indexing.