Data Consistency Solutions in Distributed Systems

Data Consistency Solutions in Distributed Systems

Problem Description: In distributed systems, data is dispersed across different nodes. Ensuring data consistency among multiple nodes is a core challenge. Please explain in detail the main solutions and their principles for distributed data consistency.

Problem-Solving Process:

Understanding the Core Issue
- In distributed systems, data typically has multiple replicas stored on different nodes to provide high availability and fault tolerance.
- When data is updated, it needs to be synchronized to all replicas. However, due to factors like network latency and node failures, synchronization may fail or be delayed, leading to inconsistencies among data replicas on different nodes.
- Core problem: In a distributed environment, after a data update operation, how can all replicas reach a consistent state?
CAP Theorem: The Foundation of Consistency
- C (Consistency): All nodes see the same data at the same time.
- A (Availability): Every request receives a response (not guaranteed to be the latest data).
- P (Partition Tolerance): The system continues to operate even when network partitions occur (some nodes cannot communicate).
- The CAP theorem states that a distributed system can only satisfy at most two of these three properties simultaneously. Under the realistic condition of needing to tolerate network partitions (P), a trade-off must be made between consistency (C) and availability (A).
Strong Consistency Solution: ACID Transactions
- Suitable for single databases or databases supporting distributed transactions (e.g., Google Spanner, TiDB).
- Uses the Two-Phase Commit (2PC) protocol to ensure ACID properties for cross-node transactions:
  - Phase One (Prepare): The coordinator asks all participants if they can commit. Participants write-ahead logs and lock resources.
  - Phase Two (Commit/Rollback): If all participants agree, the coordinator sends a commit command; otherwise, it sends a rollback command.
- Advantages: Strong data consistency, simple business logic.
- Disadvantages: Synchronous blocking leads to lower performance; risk of coordinator single point of failure.
Eventual Consistency Solution: BASE Theorem
- BA (Basically Available): When the system fails, some functions may be unavailable, but core functions remain available.
- S (Soft State): The system may have intermediate states that do not affect overall availability.
- E (Eventual Consistency): After a period of time, all replicas will eventually reach a consistent state.
- Typical implementations:
  - Asynchronous Replication: After the primary node updates, it asynchronously synchronizes to secondary nodes. Read operations during the delay may retrieve stale data.
  - Conflict Resolution Mechanisms: Such as Version Vectors or Conflict-Free Replicated Data Types (CRDTs) to automatically resolve data conflicts.
Quorum Mechanism: Balancing Consistency and Availability
- Define parameters: N (total number of replicas), W (number of replicas that must succeed for a write operation), R (number of replicas that must be queried for a read operation).
- Rule: When W + R > N, read operations are guaranteed to retrieve the latest data (strong consistency); when W + R ≤ N, it provides eventual consistency.
- Example: N=3, W=2, R=2 → Read and write operations require success on a majority of nodes, ensuring strong consistency but with higher latency; N=3, W=1, R=1 → Faster reads and writes, but stale data may be read.
Distributed Consensus Algorithms: Paxos/Raft
- Used to achieve consensus in failure-prone distributed environments (e.g., leader election, log replication).
- Raft Algorithm Process (taking leader election as an example):
  - Node roles: Leader, Follower, Candidate.
  - Election: If a Follower does not receive a heartbeat from the Leader within a timeout, it becomes a Candidate and initiates a vote. The node receiving a majority of votes becomes the new Leader.
  - Log Replication: Client write requests are sent to the Leader. The Leader persists the log and synchronizes it to Followers. The update is committed after a majority of nodes succeed.
- Advantages: Strong consistency, high fault tolerance.
- Application Scenarios: Distributed coordination systems like Etcd and Consul.
Practical Scenario Selection Suggestions
- Financial Transactions: Prioritize strong consistency (e.g., 2PC), sacrificing some availability.
- Social Networks: Prioritize availability (e.g., asynchronous queues + eventual consistency), allowing temporary inconsistencies.
- Distributed Storage: Adjust Quorum parameters (W, R, N) based on requirements to balance consistency and performance.

Through the steps above, you can select an appropriate consistency solution based on the business scenario, finding the optimal balance between performance and correctness.