Cross-Database Queries and Data Aggregation Challenges After Database and Table Sharding

Cross-Database Queries and Data Aggregation Challenges After Database and Table Sharding

Problem Description

In a sharded database architecture, data is distributed across multiple databases or tables. When query conditions do not include the shard key (e.g., querying by order creation time when sharding is based on user ID), or when performing statistics across shards (e.g., calculating total platform sales), we face challenges related to cross-database queries and data aggregation. How can these problems be solved efficiently?


1. Problem Background and Core Difficulties

Principle of Sharding: Data is typically distributed across different nodes based on a shard key (e.g., user_id) to ensure data belonging to the same user resides on the same shard, avoiding cross-shard operations.
Cross-Database Query Scenarios:

  • Non-Shard Key Queries: For example, querying orders by order_time, where data may be distributed across all shards.
  • Aggregation Queries: Such as SUM(amount), COUNT(*), which require merging results from multiple shards.
    Core Difficulties:
  • Performance Bottlenecks: Cross-shard queries involve multiple network I/Os, significantly increasing latency.
  • Data Consistency: Aggregation results may be inaccurate due to concurrent updates (e.g., double counting).
  • Scalability Limitations: A single node handling cross-shard data can become a system bottleneck.

2. Evolution of Solution Approaches

Solution 1: Client-Side Aggregation (Simple but Inefficient)

Steps:

  1. Receive the query request and parse the conditions (e.g., WHERE order_time BETWEEN '2023-01-01' AND '2023-01-31').
  2. Broadcast Query: Send the same SQL statement to all shards.
  3. Collect Partial Results: Each shard returns data matching the conditions (e.g., the list of orders within that shard).
  4. In-Memory Aggregation: Aggregate results from all shards at the application layer, performing sorting, grouping, or calculations.

Disadvantages:

  • High network overhead (more pronounced with larger data volumes).
  • The application layer must handle large amounts of data, risking out-of-memory errors.
  • Not suitable for pagination (difficult global sorting).

Solution 2: Middleware Layer Optimization (e.g., ShardingSphere, MyCat)

Core Idea: Intercept SQL through a proxy layer to automatically handle shard routing and result merging.
Steps:

  1. SQL Parsing and Rewriting: Split the original SQL into multiple sub-SQLs based on sharding rules (e.g., SELECT * FROM order_0 WHERE order_time BETWEEN ...).
  2. Parallel Execution: Send sub-queries to relevant shards simultaneously to reduce serial delay.
  3. Streaming Aggregation: Merge results row by row at the middleware layer (e.g., using merge sort) to avoid memory explosion.

Optimization Techniques:

  • Push-Down Computation: Let shards execute WHERE filtering and GROUP BY grouping first to reduce data transfer volume.
  • Pagination Optimization: Each shard first returns the top N rows of data, which are then merged, sorted, and the top M rows are taken (beware of deep pagination issues).

Solution 3: Redundant Storage and Heterogeneous Indexing

Applicable Scenarios: High-frequency non-shard key queries (e.g., querying orders by time range).
Method:

  • Dual-Write Mechanism: When writing data, in addition to storing it by the shard key (user_id), also save it to another storage engine (e.g., Elasticsearch) based on the query dimension (e.g., order_time).
  • Query Routing: Directly query target data IDs through secondary indexes, then fetch details via primary table lookups.

Advantages:

  • Avoids full shard scans, query performance approaches that of a single table.
  • Can combine with search engines for complex conditional filtering.

Challenges:

  • Data synchronization delays may cause short-term inconsistencies.
  • Increased storage costs, requiring a trade-off based on business importance.

Solution 4: Pre-Computation and Materialized Views

For Aggregation Queries: Such as daily sales statistics.
Principle:

  • Pre-Compute: Update aggregation results synchronously during data writes (e.g., accumulating sales_amount by day).
  • Store Results: Save aggregation results to a separate table (e.g., daily_sales), and read directly during queries.

Implementation Methods:

  • Triggers or Listening to Binlog: Update summary tables in real-time.
  • Batch Processing: Compute historical data offline (e.g., Spark jobs).

Notes:

  • Pre-computation dimensions need to be planned in advance, making it difficult to handle ad-hoc queries.
  • Must handle atomicity during concurrent updates (e.g., using distributed locks or CAS operations).

3. Trade-offs and Selection Recommendations in Practice

Solution Applicable Scenarios Key Constraints
Client-Side Aggregation Small data volume, low-frequency queries Result set must be manageable
Middleware Layer Optimization Routine pagination/sorting needs Relies on middleware capabilities
Redundant Storage High-frequency non-shard key queries Requires ensuring data synchronization
Pre-Computation Fixed-dimension statistics Business logic must be stable

Supplementary Strategies:

  • Limit Cross-Shard Queries: Enforce inclusion of the shard key in business design (e.g., query user ID first, then query orders).
  • Eventually Consistent Aggregation: Allow statistical results to be temporarily lagged, corrected through asynchronous compensation (e.g., reconciliation programs).

Summary

The essence of cross-database queries is trading storage for computation or trading redundancy for performance. In practical architectures, it's necessary to combine multiple solutions based on business characteristics (e.g., using redundant indexes for hotspot queries, and middleware for occasional aggregations). The key is to reduce cross-shard operations through reasonable shard design, rather than blindly pursuing a universal solution.