Consensus Algorithms in Distributed Systems: A Detailed Explanation of the Raft Algorithm

Consensus Algorithms in Distributed Systems: A Detailed Explanation of the Raft Algorithm

Problem Description: Please explain the core goals of consensus algorithms in distributed systems, and elaborate in detail on how the Raft algorithm achieves consensus through its core mechanisms (leader election, log replication, and safety). Please state the main advantages of Raft compared to other consensus algorithms (such as Paxos).

Problem-solving Process / Knowledge Point Explanation:

Consensus algorithms are the cornerstone of distributed systems. Their goal is to allow a group of computers (servers) to reach agreement on a value or a series of operations (i.e., the operation log of a state machine) even when some members may fail or the network experiences delays/partitions. Without it, highly available, strongly consistent systems cannot be built.

The Raft algorithm was designed to be more understandable than traditional algorithms like Paxos. Its core idea is to decompose the consensus problem into three relatively independent sub-problems.

Step 1: Basic Settings of Raft

Before diving into the core mechanisms, let's establish some basic concepts:

Node States: At any given time, each server node is in one of the following three states:
- Leader: Responsible for handling all client requests and managing log replication. There is one and only one valid leader in a Raft cluster.
- Follower: Passively responds to requests from the leader and candidates. They are the "silent majority" of the system.
- Candidate: When a follower does not receive messages from the leader for a certain period, it can transition to the candidate state, initiate an election, and attempt to become the new leader. This is a temporary state.
Term: Raft divides time into arbitrarily long "terms," identified by consecutively increasing numbers (e.g., term 1, term 2...). Each term begins with an election. If a candidate wins the election, it serves as the leader for that term until the term ends. Terms serve as a logical clock to identify stale information.
RPC Communication: Nodes communicate via Remote Procedure Calls (RPC), primarily using two types:
- RequestVote RPC: Initiated by candidates during elections to solicit votes.
- AppendEntries RPC: Periodically initiated by the leader, serving two purposes: a) Heartbeat, to maintain its authority and prevent new elections; b) Replicate log entries, to synchronize new operation commands to followers.

Step 2: Core Mechanism One - Leader Election

This is key to Raft ensuring system availability. The goal is to quickly and safely elect a new leader when the old leader fails.

Election Trigger Condition: Each follower maintains a randomized election timeout timer (e.g., 150-300 milliseconds). This timer is reset whenever it receives a valid RPC from the current leader or a candidate. If the timer expires (meaning it hasn't heard from the leader for a while), the follower assumes the leader has failed and starts a new election.
Election Process:
- The follower increments its current term number and transitions to the candidate state.
- It votes for itself first.
- It then sends RequestVote RPCs to all other servers in the cluster in parallel.
- The RPC contains the candidate's term number and log information.
Voting Rules: Each node can vote at most once per term (first-come, first-served). A node will only vote for a candidate that meets the following conditions:
- The candidate's term number >= its own current term number.
- The candidate's log is at least as up-to-date as its own (a safety rule, explained later).
- It has not voted yet in this term.
Election Outcome:
- Winning the Election: If a candidate receives votes from a majority (N/2 + 1) of the nodes, it successfully becomes the new leader. Upon becoming the leader, it immediately sends heartbeats (empty AppendEntries RPCs) to all nodes to announce its status and prevent new elections.
- Losing the Election: If, while waiting for votes, a candidate receives an AppendEntries RPC from another server claiming to be the leader, and the term number in that RPC >= its own term, the candidate will acknowledge this leader as legitimate and revert to the follower state.
- Election Timeout / Split Vote: If no candidate receives a majority of votes (e.g., multiple followers become candidates simultaneously, splitting the vote), the election yields no result. Each candidate's election timeout timer is randomized again and restarted. Eventually, one candidate will time out and start a new election (with an incremented term number). The randomized timeout times significantly reduce the probability of repeated split votes.

Step 3: Core Mechanism Two - Log Replication

Once a leader is elected, it begins serving client requests. All client requests must be sent to the leader.

Receive Client Command: The leader appends the command contained in the client request as a new log entry to its own log. This entry includes the term number and the command itself.
Replicate Log to Followers: The leader sends this new entry to all followers in parallel via AppendEntries RPCs.
Follower Response: Upon receiving the RPC, each follower performs consistency checks (e.g., whether the previous log's index and term match). If the checks pass, it appends the new entry to its local log and returns a success response to the leader.
Leader Commits Log: When the leader receives successful responses from a majority of followers, it considers that log entry to be committed. A committed log entry is finalized and will never be rolled back.
- The leader applies the command to its own state machine and returns the execution result to the client.
- In subsequent AppendEntries RPCs (including heartbeats), the leader notifies all followers of the latest committed index. Followers apply an entry to their own state machine only after learning it is committed.

Log Composition: Each log entry contains three pieces of information:

Index: The slot number in the log.
Term Number: The term of the leader that created the entry.
Command: The operation for the state machine to execute.

Log Consistency: Raft maintains strong consistency by forcing followers' logs to be identical to the leader's. If a follower's log becomes inconsistent (e.g., due to rejoining after a network partition), the leader uses the consistency check in AppendEntries RPCs to backtrack and retransmit step by step, eventually overwriting the inconsistent logs on the follower.

Step 4: Core Mechanism Three - Safety

The election and replication mechanisms described above are foundational, but additional key safety rules are required to guarantee the correctness of consensus.

Election Restriction: This is the most important safety rule. It requires that a candidate must possess all committed log entries to win an election. This is reflected in the voting rule as "the candidate's log is at least as up-to-date as its own."
- Comparison Method: Compare the last log entry of two nodes. First, compare the term number of the last log entry; the log with the larger term number is more up-to-date. If the term numbers are the same, the log with the larger index is more up-to-date.
- Purpose: This rule ensures that only nodes containing all committed entries can potentially become leaders, thereby preventing the "data loss" problem where committed logs could be overwritten by a new leader.
Committing Log Entries from Previous Terms: A leader can only indirectly commit log entries from previous terms by replicating a log entry from its current term. Specifically, when a leader replicates an entry from its current term to a majority of nodes, all preceding log entries (including those from previous terms) are automatically considered committed due to the "Election Restriction." This prevents a stale leader from "ghost reappearing" and committing old logs, which could cause data inconsistency.

Step 5: Main Advantages of Raft

Compared to the classic Paxos algorithm, Raft's advantages are mainly reflected in:

Understandability: By decomposing the problem (leader election, log replication, safety) and using a stronger concept of leadership, Raft significantly reduces the difficulty of understanding and teaching. Paxos is "notoriously" obscure and difficult to understand.
Ease of Implementation: Clear specifications make Raft's implementation more straightforward and easier to prove correct. There are now high-quality open-source implementations in multiple languages.
Decoupled Design: Its modular design (separating consensus, membership changes, log compaction, etc.) makes the system easier to maintain and extend.

Through the detailed breakdown in the above five steps, we can see how Raft, in a clear and structured manner, robustly achieves consensus in distributed systems.