Principles and Applications of B-tree and B+ Tree

Principles and Applications of B-tree and B+ Tree

Problem Description
B-tree and B+ tree are balanced multi-way search trees specifically designed for external storage devices like disks. They are widely used as index structures in database systems and file systems. The problem requires an understanding of the core characteristics, operational logic (insertion, deletion, query), and their optimization significance in practical systems.

Knowledge Point Explanation

Background and Design Motivation
- Traditional binary search trees (e.g., AVL trees) are efficient in memory, but when data volume is too large, it needs to be stored on disk. Disk I/O speed is much slower than memory. Each read of a node may trigger one disk access.
- If the tree height is high, queries require multiple disk I/Os (e.g., a tree of height 10 requires 10 I/Os), leading to low efficiency.
- B-tree reduces the tree height by increasing the number of child nodes per node (multi-way branching), thereby decreasing the number of disk I/Os. For example, a B-tree of degree 100 can store 1 million data items with only 2-3 levels.
Core Characteristics of B-tree
- Definition: A B-tree is a self-balancing m-way search tree (m≥2) that satisfies the following properties:
  1. Each node has at most m child nodes.
  2. Non-root nodes have at least ⌈m/2⌉ child nodes (except the root).
  3. The number of keys in a node satisfies: when the number of child nodes is k, the number of keys is k-1, and the keys are arranged in ascending order.
  4. All leaf nodes are on the same level, ensuring balance.
- Example: A B-tree of order 3 (m=3), where each node has at most 2 keys and 3 child nodes, and non-root nodes have at least 1 key.
Detailed Operations of B-tree
- Query Operation:
  1. Start from the root node, sequentially search for the key within the node (binary search can be used for acceleration).
  2. If found, return; otherwise, proceed to the corresponding child node based on the key range.
  3. Repeat until reaching a leaf node. If not found, the key does not exist.
  - Advantage: Low tree height, query complexity is O(log_m n), and each node access corresponds to one disk I/O.
- Insertion Operation:
  1. First, find the leaf node where the key should be inserted. After insertion, check if the number of keys in the node exceeds m-1.
  2. If not exceeded, simply finish.
  3. If exceeded (node is full), perform splitting:
    - Promote the middle key to the parent node, and split the node into two new nodes (left and right parts).
    - If the parent node consequently exceeds its limit, continue splitting upward, potentially increasing the tree height.
  - Example: Insert key 5 into a 3-order B-tree. If a leaf node already contains [3,4,6], split it into [3], [6], and promote 4 to the parent node.
- Deletion Operation:
  1. Find the target key. If it's in a non-leaf node, replace it with its predecessor or successor (located in a leaf node), then proceed to delete the key from the leaf node.
  2. After deletion, check if the number of keys in the node falls below ⌈m/2⌉-1.
  3. If insufficient, first try to borrow a key from a sibling node:
    - If a sibling node has surplus keys, transfer one key via the parent node.
  4. If sibling nodes are insufficient, perform merging: Merge a key from the parent node with two sibling nodes, which may trigger cascading adjustments in the parent node.
Improvements and Advantages of B+ Tree
- Structural Differences:
  1. Non-leaf nodes only store keys and child node pointers, not actual data records.
  2. All data records are stored in leaf nodes, and leaf nodes are linked into an ordered linked list via pointers.
- Advantages:
  1. More Stable Query Efficiency: Any query must reach a leaf node, ensuring the same path length.
  2. Efficient Range Queries: Direct traversal via the leaf node linked list without backtracking to upper-level nodes.
  3. Non-leaf Nodes Can Accommodate More Keys: With fixed node size, removing data records makes the structure "leaner and taller," further reducing I/O.
Practical Application Scenarios
- Database Indexing: MySQL's InnoDB engine uses B+ tree as its index structure. Primary key index leaf nodes directly store row data, while secondary indexes store primary key values.
- File Systems: NTFS, HFS+, etc., use B+ trees to manage file block location information.
- Comparison with B-tree: B-tree is more suitable for non-clustered indexes or scenarios where data is stored directly in nodes, but B+ tree has obvious advantages in range queries and disk scans.

Summary
B-tree and B+ tree optimize disk I/O through multi-way balanced design. On this basis, B+ tree further enhances range query and scan efficiency by separating indexes from data and linking leaf nodes. Understanding operational details like splitting and merging helps in rationally selecting index structures when designing storage systems.