Multi-Site Active-Active Architecture Design in Distributed Systems

Multi-Site Active-Active Architecture Design in Distributed Systems

Description
Multi-site active-active architecture is a distributed system design pattern that deploys multiple data centers in different geographic regions, enabling each data center to simultaneously provide read and write services. Its core objectives are to ensure business continuity in the event of a data center failure or regional disaster, while also enhancing user experience. Unlike traditional disaster recovery setups (cold/hot standby), multi-site active-active requires each data center to be active, with data writes occurring simultaneously across multiple locations.

Design Process

1. Understanding Core Challenges

Data Consistency: How to resolve conflicts when multiple data centers write to the same data concurrently?
Network Latency: How to ensure a good user experience given high cross-regional network latency?
Traffic Routing: How to intelligently route user requests to the nearest or most appropriate data center?

2. Design Principles

Service Tiering: Not all services are suitable for active-active. Prioritize implementing active-active for core services (e.g., login, transactions). Non-core services (e.g., reporting) can be centrally deployed.
Data Partitioning: Limit data reads and writes to specific data centers by partitioning data based on criteria like user ID hashing, reducing the need for cross-center data synchronization.
Eventual Consistency: Allow temporary inconsistencies during cross-center data synchronization, but ensure final consistency through conflict resolution mechanisms.

3. Key Technical Solutions

Routing Layer Design:
1. Use DNS resolution or HTTP redirection to direct user requests to the nearest data center.
2. Dynamically adjust routing policies based on user location information (e.g., IP address).
  Example: When a user accesses from Beijing, DNS returns the IP address of the Beijing data center.
Data Synchronization Mechanisms:
1. Asynchronous Replication: Use message queues or database log synchronization tools (e.g., Canal, Debezium) to asynchronously transmit data changes, avoiding performance bottlenecks from cross-center write operations.
2. Conflict Resolution:
  - Timestamp Priority: Prefer the write operation with the latest timestamp (requires clock synchronization).
  - Business Rule Priority: For example, in e-commerce inventory conflicts, prioritize the operation that successfully deducted stock.
  - Manual Intervention: Complex conflicts are logged and escalated for manual handling.
Globally Unique ID Generation:
Use algorithms like Snowflake or distributed ID generators incorporating data center identifiers to avoid ID conflicts across centers.
Disaster Recovery and Traffic Switching:
1. Monitor the status of each data center (e.g., network latency, failure rate).
2. When a center fails, route traffic to other centers via the routing layer and mark the failed center as "read-only."

4. Practical Case: Multi-Site Active-Active for User Login

Data Partitioning: Users are assigned to Beijing or Shanghai data centers based on ID hashing. Login authentication occurs only at the user's assigned center.
Session Synchronization: After successful login, session information is asynchronously replicated to other centers, ensuring users don't need to re-login when switching centers.
Conflict Handling: If a user changes their password in Beijing and Shanghai simultaneously, prioritize the request with the latest timestamp.

5. Validation and Testing

Chaos Engineering: Simulate data center network outages or latency to verify the system's automatic failover and data recovery capabilities.
Consistency Verification: Periodically compare data across centers to ensure synchronization mechanisms are reliable.

Summary
The core of multi-site active-active architecture is to tolerate temporary data inconsistency while ensuring availability through data partitioning, asynchronous replication, and intelligent routing. Design requires balancing business needs with complexity to avoid over-engineering.