In-depth Optimization of Database Pagination Queries for Backend Performance

In-depth Optimization of Database Pagination Queries for Backend Performance

I. Problem Description and Background

In backend services handling large-scale data processing, pagination queries are one of the most common and frequent operations. Examples include user order lists, product lists, system log queries, and other similar scenarios. However, when the data volume reaches millions or even tens of millions, the performance of a simple LIMIT offset, size query degrades sharply, especially in cases of deep pagination (i.e., when the offset value is very large). This not only slows down the database response but also significantly increases memory consumption and latency on the application server, becoming a bottleneck for system performance.

Core Issue:
Traditional LIMIT pagination requires the database to scan and skip offset records to retrieve the target data during deep pagination. This "skipping" operation incurs an extremely high cost for large offset values because it necessitates reading a large number of unnecessary rows and then discarding them.

II. Evolution and In-depth Analysis of Solutions

Let's start with the most basic method and gradually delve deeper, explaining optimization strategies for different scenarios.

Step 1: Basic Review and Problem Reproduction

First, let's examine the most common pagination SQL writing method and its issues.

-- Traditional pagination query
SELECT * FROM `orders` ORDER BY `create_time` DESC LIMIT 100000, 20;

Execution Process Analysis:

The database must sort the result set according to ORDER BY (file sort will be performed if no index exists).
The database scans the data that meets the conditions, counting from the first record until it reaches the 100,000th record.
Only then does it return the next 20 records.
In this process, the database engine actually processes 100,020 rows of data but returns only the last 20 rows. This is a significant waste of resources.

Performance Bottleneck: As offset increases, the cost of ORDER BY and the "skipping" operation grows linearly, leading to progressively longer query times.

Step 2: Primary Optimization - Utilizing Covering Indexes and Deferred Joins

The optimization idea is to reduce the amount of data that the database needs to scan and process.

Covering Index Optimization:
First, ensure that the fields in ORDER BY and WHERE have appropriate indexes. However, having an index alone is not enough; the "skipping" operation in deep pagination still occurs on the index, incurring costs.
```
-- Assuming an index has been established on (create_time)
-- The query still scans 100,020 entries on the index
SELECT * FROM `orders` ORDER BY `create_time` DESC LIMIT 100000, 20;
```
Deferred Join Optimization:
This is a more advanced technique. The idea is to first use a covering index to quickly locate the primary key IDs of the target rows, and then use these primary key IDs to join back to the original table to retrieve the complete row data. This significantly reduces the amount of data that needs to be sorted and skipped.
```
SELECT *
FROM `orders` AS main
INNER JOIN (
    -- Subquery: Utilize the covering index to select only the primary key IDs
    SELECT `id`
    FROM `orders`
    -- WHERE conditions can be added here
    ORDER BY `create_time` DESC
    LIMIT 100000, 20
) AS sub ON main.id = sub.id
ORDER BY main.create_time DESC; -- Outer sorting ensures consistent order
```
Why is this better?
- The subquery SELECT id FROM orders ... typically only needs to scan the index itself (since the index contains id and create_time). The index tree is smaller, making scanning faster.
- After the database locates the 20 target id values on the index, it performs 20 efficient primary key lookups on the original table. This is much faster than scanning 100,000 rows of complete data.

Step 3: Ultimate Optimization - Cursor-based Pagination (Seek Method)

This is the "silver bullet" for solving deep pagination problems, especially suitable for infinite scroll scenarios. Its core idea is: Do not record how many rows have been skipped, but record "where we last saw," and then continue from there.

Principle:
Assume the data list is sorted in descending order by create_time. When returning the first page, we not only return the data but also the create_time and id of the last record. When fetching the next page, the client passes these values to the backend.

Implementation Method:

-- First page query
SELECT * FROM `orders` ORDER BY `create_time` DESC, `id` DESC LIMIT 20;
-- Assume the last record is (create_time='2023-10-10 10:00:00', id=12345)

-- Second page query: Start after the last record of the previous page
SELECT * FROM `orders`
WHERE (`create_time` < '2023-10-10 10:00:00')
   OR (`create_time` = '2023-10-10 10:00:00' AND `id` < 12345)
ORDER BY `create_time` DESC, `id` DESC
LIMIT 20;

Key Points:

The sorting condition must be unique or nearly unique. ORDER BY create_time DESC, id DESC ensures the uniqueness of sorting, preventing pagination data disorder due to identical timestamps.
The WHERE clause uses compound conditions, efficiently leveraging the composite index on (create_time, id) for range queries, directly "positioning" the starting point and completely skipping the first N rows.

Advantages:
- Query time is constant, independent of page depth, offering excellent performance.
- Avoids the issues of "duplicate data" or "missing data" that occur with traditional pagination when data is frequently added or deleted.
Limitations:
- Cannot directly jump to a specified page (e.g., page 100); only supports continuous pagination like "previous page/next page."
- Requires client-side cooperation to maintain cursor state.

Step 4: Compromise Solution - Primary Key-based Range Pagination

If the business indeed requires page jumping but can accept approximate pagination, this method can be used.

-- Assume the primary key ID of the last record on the previous page is known as last_id
SELECT * FROM `orders`
WHERE `id` < last_id  -- Leverage the continuity of the primary key
ORDER BY `id` DESC
LIMIT 20;

This method performs extremely well but requires the primary key to be sequentially auto-incremented and the list order to match the primary key order. It is not suitable for scenarios with complex sorting or filtering conditions.

III. Summary and Selection Recommendations

Small Data Volume (Thousands/Tens of Thousands): Directly use LIMIT offset, size; it's simple and effective.
Moderate Data Volume, Requiring Precise Pagination: Prioritize using deferred join optimization. Ensure that ORDER BY and WHERE conditions have appropriate indexes and rewrite the SQL to separate primary key queries.
Large Data Volume, Infinite Scroll/Continuous Pagination Scenarios: Strongly recommend using cursor-based pagination. This is the best practice for performance and consistency.
Large Data Volume, Must Support Page Jumping:
- If the sorting rules are simple (e.g., by primary key or time), consider range pagination.
- If not feasible, use deferred joins and set a reasonable maximum page limit (e.g., allowing only the first 1000 pages to be queried), clearly informing users.
- The ultimate solution is to introduce a search engine (e.g., Elasticsearch), which is specifically optimized for complex searches and pagination.

Additional Considerations:

Business Degradation: When deep pagination queries are detected, consider returning less data or guiding users to add more filter conditions.
Data Preheating: For particularly popular data, cache the first N pages of results in the application-layer cache.

Understanding and flexibly applying these pagination optimization patterns is a key skill for building high-performance, scalable backend services.