Index Selection and Index Failure Scenarios in Database Query Optimization

Index Selection and Index Failure Scenarios in Database Query Optimization

Problem Description
Indexes are one of the core means for optimizing database query performance. However, incorrect usage can lead to index failures, which in turn reduces query efficiency. Interviewers often examine: How to choose the appropriate index based on query conditions? Which scenarios cause index failures? How to judge index usage through execution plans? Understanding these is crucial for SQL optimization in practical work.

1. Basic Functions of Indexes and Selection Principles
The essence of an index is to reduce the amount of data scanning, quickly locating data through structures like B+ trees. Index selection should follow these principles:

Prioritize high-selectivity columns: The more unique the values of the indexed column (e.g., ID number), the more data is filtered, and the more effective the index.
Covering indexes reduce table lookups: If the index contains all columns required by the query (e.g., SELECT name FROM users WHERE age=20, with an index on (age, name)), there is no need to access the data rows, and results are returned directly.
Leftmost prefix matching: Composite indexes (e.g., (a, b, c)) can only be used sequentially from left to right. The query condition must include the leftmost column (e.g., a=1 or a=1 AND b=2), otherwise, the index cannot be utilized.

2. Common Scenarios and Principles of Index Failure
The following scenarios cause the database optimizer to abandon using the index and resort to full table scans:

Applying operations or functions to indexed columns
- Example: WHERE YEAR(create_time) = 2023
- Reason for failure: The index stores the original values. After calculations on the column, all data must be traversed to compute and compare values.
- Optimization: Change to a range query WHERE create_time BETWEEN '2023-01-01' AND '2023-12-31'.
Implicit type conversion
- Example: The phone column in the table is of string type, but the query is written as WHERE phone = 13800138000 (numeric type).
- Reason for failure: The database must convert the phone column row by row to numeric for comparison, making index usage impossible.
- Optimization: Ensure consistent types: WHERE phone = '13800138000'.
Fuzzy queries starting with a wildcard
- Example: WHERE name LIKE '%小明'
- Reason for failure: B+ tree indexes are sorted by prefix and cannot match suffixes. LIKE '小明%' can still use the index.
- Optimization: Consider full-text indexes or inverted indexes.
OR connecting conditions on non-indexed columns
- Example: WHERE indexed_column = 1 OR non_indexed_column = 2
- Reason for failure: OR requires satisfying either condition. Even if part of the condition can use an index, a full table scan is still needed to verify other conditions.
- Optimization: Split into UNION queries:
```
SELECT * FROM t WHERE indexed_column = 1  
UNION  
SELECT * FROM t WHERE non_indexed_column = 2  
```
Composite index not satisfying the leftmost prefix
- Example: Index is (a, b), query condition is WHERE b=2.
- Reason for failure: The index is sorted by a. When a is unknown, b cannot be quickly located.
- Optimization: Adjust the index order or add a separate index.

3. Analyzing Index Usage via Execution Plans
Taking MySQL's EXPLAIN command as an example, key field interpretations:

type: Query efficiency from high to low is const > ref > range > index > ALL. If it is ALL, it indicates a full table scan.
key: The actual name of the index used. If it is NULL, no index is used.
Extra:
- Using index: A covering index is used, no table lookup is required.
- Using where: The server layer needs to filter data (the index may be partially effective).
- Using filesort: Additional sorting is required, consider adding an index to optimize sorting.

4. Practical Case: Index Selection Strategy
Assuming the table orders has fields user_id (high selectivity), status (low selectivity, only 3 states), and create_time. Query:

SELECT * FROM orders  
WHERE user_id = 1001 AND status = 'shipped'  
ORDER BY create_time DESC;

Option 1: Separate index (user_id)
- Advantage: Quickly locates data for user_id=1001.
- Disadvantage: Requires filtering status and sorting in memory, which is inefficient with large data volumes.
Option 2: Composite index (user_id, status, create_time)
- Advantage: Covers query conditions and sorting, avoiding table lookups and file sorting.
- Verification: The EXPLAIN result should show type: ref and Extra: Using index.

Summary
Index selection must combine query conditions, sorting, grouping, and other requirements, avoiding failure scenarios. Verify index effectiveness through execution plans, prioritizing covering indexes and composite indexes to reduce disk I/O. In practical work, the read/write overhead of indexes must be balanced to avoid over-indexing.