Application and Performance Comparison of B-trees and B+ Trees in Database Indexing

Application and Performance Comparison of B-trees and B+ Trees in Database Indexing

Problem Description
B-trees and B+ trees are the two most fundamental data structures in database indexing. They significantly enhance data query efficiency by optimizing disk I/O. Interviews often require a comparison of their differences and an explanation of why modern databases (such as MySQL's InnoDB engine) widely adopt B+ trees. Understanding their design principles, node structure, query performance differences, and applicable scenarios is key to mastering the underlying mechanisms of databases.

Solution Process

1. Starting from the Practical Problem: Why B-trees/B+ Trees?

Background: Database data is stored on disk, and disk I/O is orders of magnitude slower than memory access. Using a binary search tree for indexing could result in a very tall tree, leading to multiple disk accesses.
Core Objective: Reduce the number of disk I/Os → Lower the tree height → Allow each node to store more keys and load more data per disk read.

2. Core Design of B-trees (Balanced Trees)

Definition: A multi-way balanced search tree where each node can contain multiple keys and child pointers.
Key Characteristics (taking an m-order B-tree as an example):
1. Each node has at most m children (m≥2).
2. Non-root nodes have at least ⌈m/2⌉ children (except the root).
3. All leaf nodes are on the same level (perfectly balanced).
Node Structure:
```
Node Example: | Key1 | Key2 | ... | Keyk | Child Pointer0 | Child Pointer1 | ... | Child Pointerk |
```
- Keys are sorted in ascending order. Child Pointer i points to the subtree containing keys between Key i and Key i+1.
- Data can be stored directly within the node (keys and data records may coexist).

3. Improved Design of B+ Trees

Core Differences from B-trees:
1. Data stored only in leaf nodes: Internal nodes store only keys and child pointers, not actual data records.
2. Leaf nodes linked via pointers: All leaf nodes form an ordered linked list, supporting range queries efficiently.
Node Structure Comparison:
- B-tree node: Mixed storage of [Key, Data Pointer].
- B+ tree internal node: Stores only Key + Child Pointer; Leaf node: Stores Key + Data Pointer and includes a pointer to the next leaf node.

4. Step-by-Step Comparison: Query, Insert, Delete, and I/O Efficiency

Scenario 1: Equality Query (e.g., `SELECT * FROM users WHERE id=100`)

B-tree:
- If the key is found in an internal node, data can be returned directly (possible early termination).
- However, storing data in internal nodes reduces the number of keys per page → potentially slightly taller tree.
B+ Tree:
- Must traverse to a leaf node to retrieve data, path length is fixed.
- But internal nodes store no data, allowing more keys per node → lower tree height, fewer I/Os.
Example: Assuming a page size of 4KB, key size 8B, pointer size 6B.
B-tree node: Key + Data Pointer (e.g., 8B+8B=16B) → ~250 entries per page (4000/16≈250).
B+ tree internal node: Only Key + Child Pointer (8B+6B=14B) → ~285 entries per page (4000/14≈285).
Impact on Tree Height: For 1 billion records, a B+ tree might be 1 level shorter than a B-tree (assuming nodes are full).

Scenario 2: Range Query (e.g., `SELECT * WHERE id BETWEEN 100 AND 200`)

B-tree:
- Requires repeated backtracking during in-order traversal, potentially involving accesses to nodes across multiple levels, resulting in many I/Os.
B+ Tree:
- After finding the starting key in a leaf node, sequential scan along the linked list suffices → no need to go back to upper levels, I/O is nearly sequential, extremely efficient.

Scenario 3: Insertion and Deletion

Commonality: Both may cause recursive adjustments due to node splits/merges.
B+ Tree Advantage:
- All data resides at the leaf level, insertions/deletions only affect leaf nodes, making adjustments more localized.
- The leaf node linked list structure simplifies rebalancing with neighboring nodes.

5. Why Do Databases Choose B+ Trees? — Quantitative Analysis

Higher Fan-out: Internal nodes store no data → shorter tree → fewer disk I/Os for queries.
Better Suited for Disk Prefetching: The leaf node linked list aligns with disk sequential access patterns, significantly boosting range query efficiency.
Stability: All queries must go to leaf nodes, guaranteeing stable O(log n) time complexity.
Example Calculation (same assumptions as above):
- B+ Tree: A 3-level tree can store entries ≈ 285^2 * (# entries per leaf node, which stores more data pointers).
- For the same data volume, a B-tree might require 4 levels → one extra disk I/O (mechanical disk I/O ~10ms, a significant impact).

6. Interview Extension: Specific Implementation of B+ Trees in MySQL InnoDB

Clustered Index: Leaf nodes directly store the complete row data (primary key index).
Non-clustered Index (Secondary Index): Leaf nodes store primary key values; a 'bookmark lookup' (回表查询) requires a second access.
Adaptive Hash Index: InnoDB builds hash indexes in memory for hot B+ tree pages to accelerate queries.

Summary
Through two key designs—"internal nodes store only keys" and "leaf node linked list"—B+ trees comprehensively outperform B-trees in range queries, tree height control, and disk I/O optimization. Although B-trees might return early for equality queries, databases prioritize the prevalence of range queries and stability. Therefore, B+ trees have become the de facto standard for database indexes. In your answer, you can support your points with disk principles (like page-based storage, sequential access advantages) and specific database implementations (like MySQL).