Data Structures and Selection of Database Indexes

Data Structures and Selection of Database Indexes

Problem Description:
Database indexes are a key technology for improving query performance. They are essentially efficient data structures used to quickly locate and access specific data in database tables. Different index data structures (such as B+ trees, hash indexes, bitmap indexes, etc.) are suitable for different scenarios. Interview questions often require explaining the principles, advantages, and disadvantages of common index structures, as well as how to select the appropriate index type based on actual business requirements (such as query type, data distribution, and frequency of write operations).

Knowledge Explanation:

Core Function of Indexes:
Indexes are similar to a book's table of contents. By maintaining an additional data structure (index files), they avoid full table scans, reducing query time complexity from O(n) to near O(log n) or O(1). However, indexes occupy storage space and add maintenance overhead during data writes (such as balanced tree adjustments and page splits).
B+ Tree Index (Most Common):
- Structural Principle:
  A B+ tree is a multi-way balanced search tree with the following characteristics:
  - All data records are stored in leaf nodes, while non-leaf nodes only store key values (index keys) and child node pointers.
  - Leaf nodes are linked in an ordered list via pointers, supporting efficient range queries (e.g., WHERE age BETWEEN 20 AND 30).
- Applicable Scenarios:
  - Suitable for range queries, sorting (ORDER BY), and fuzzy matching (LIKE 'abc%').
  - Ideal for OLTP systems with balanced read and write operations (e.g., MySQL's InnoDB default uses B+ trees).
- Disadvantages:
  - Less efficient than hash indexes for purely random key value queries (e.g., WHERE id = 123).
Hash Index:
- Structural Principle:
  Uses a hash function to map index keys to fixed-size hash buckets, where each bucket stores pointers to data rows. Ideally, query time complexity is O(1).
- Applicable Scenarios:
  - Only supports equality queries (=, IN), not range queries or sorting.
  - Suitable for in-memory databases (e.g., Redis) or scenarios with frequent equality queries.
- Disadvantages:
  - Hash collisions may affect performance; dynamic resizing is required for uneven data distribution.
Bitmap Index:
- Structural Principle:
  Creates a bitmap for each index key value, where each bit indicates whether a row in the table contains that key value. For example, a bitmap for a gender field: Male: 1010, Female: 0101.
- Applicable Scenarios:
  - Suitable for low-cardinality columns (with many repeated values), such as gender or status flags.
  - Efficiently handles multi-condition combined queries (via bitwise operations AND/OR).
- Disadvantages:
  - Coarse-grained locking during high-concurrency write operations (may lock large portions of the bitmap), making it unsuitable for frequently updated OLTP systems.
Practical Principles for Index Selection:
- Prioritize Query Type:
  - Primarily range queries → B+ tree.
  - Primarily equality queries without sorting → Hash index (if supported by the database, e.g., MySQL's Memory engine).
- Consider Data Characteristics:
  - Low-cardinality fields with multi-condition queries → Bitmap index (e.g., data warehousing scenarios).
  - High-cardinality fields (e.g., user ID) → B+ tree.
- Balance Maintenance Costs:
  - Exercise caution in write-intensive scenarios: B+ tree update costs are lower than bitmap indexes, while hash indexes may briefly impact performance during resizing.

Summary:
Index selection is a process of balancing query efficiency, storage overhead, and maintenance costs. B+ trees have become mainstream due to their versatility; hash indexes are suitable for specific high-speed queries; bitmap indexes are significant in analytical systems. In practice, design must also consider database engine features (e.g., MySQL's adaptive hash index) and index optimization strategies (e.g., covering indexes, the leftmost matching principle).