This article provides a comprehensive guide to the Raft consensus algorithm, as detailed in the canonical "Raft Guide." Raft is designed for understandability, decomposing the complex problem of consensus into three relatively independent subproblems: leader election, log replication, and safety. Its primary goal is to provide a clear and manageable alternative to other consensus algorithms like Paxos, making distributed consensus more accessible to a broader audience of system builders.
Table of Contents
- Core Principles and State
- Leader Election: Establishing Authority
- Log Replication: Ensuring Consistency
- Safety and Commitment Rules
- Cluster Membership Changes
- Log Compaction and Snapshotting
- Conclusion: The Value of Understandability
Core Principles and State
Raft operates on a fundamental principle: a single strong leader is responsible for managing the replicated log. This leader-centric approach simplifies the algorithm's logic, as all client interactions and log management flow through one node at a time. Every server in a Raft cluster exists in one of three states: Leader, Follower, or Candidate. The leader handles all client requests, appending entries to its log and replicating them to followers. Followers are passive, simply responding to requests from the leader and candidates. The candidate state is a transient phase used during leader elections.
Each server maintains several persistent properties that survive crashes, including the current term number, the candidate voted for in that term, and its log entries. Crucial volatile state includes the index of the highest log entry known to be committed and the index of the highest log entry applied to the state machine. The leader also maintains nextIndex and matchIndex for each follower to track replication progress. The term number is a logical clock that increases monotonically and is central to detecting stale leaders and ensuring safety during elections.
Leader Election: Establishing Authority
Raft uses randomized timeouts to initiate leader elections, a key mechanism for robustness. Followers start an election timer upon receiving communication from a valid leader or candidate. If this timer expires before hearing from a leader, the follower assumes the leader has failed and transitions to candidate state. The candidate increments its current term, votes for itself, and issues RequestVote RPCs to all other servers.
A candidate wins the election and becomes leader if it receives votes from a majority of the cluster for that term. This majority guarantee ensures that at most one leader can be elected per term, a critical safety property. If another server with a higher term is discovered, or if the election times out without a winner, the candidate steps down. Split votes, where multiple candidates receive equal votes, are resolved by randomized election timeouts, causing candidates to restart the process and typically leading to a quick resolution.
Log Replication: Ensuring Consistency
Once elected, the leader begins servicing client requests. Each request contains a command to be executed by the replicated state machine. The leader appends the command as a new entry to its log, then issues AppendEntries RPCs in parallel to each follower to replicate the entry. An entry is considered committed once it is stored on a majority of servers. The leader then applies the committed entry to its state machine and notifies the client of the result. The leader also informs followers of the commit index in subsequent AppendEntries RPCs, allowing them to safely apply entries to their own state machines.
The leader enforces a strict log consistency property. Each log entry stores the leader's term number when the entry was created, and the AppendEntries RPC includes the index and term of the entry immediately preceding the new ones. A follower will only accept new entries if this previous log information matches its own log. In cases of inconsistency, the leader will repeatedly decrement the nextIndex for that follower and retry until log matching is achieved, effectively forcing the follower's log to converge with the leader's. This mechanism ensures all logs are eventually identical across the committed prefix.
Safety and Commitment Rules
Raft incorporates several critical rules to guarantee safety under all failure scenarios. The Election Restriction is paramount: a candidate can only win an election if its log is at least as up-to-date as those of the majority that votes for it. "Up-to-date" is determined by comparing the term and index of the last log entry. This rule prevents a candidate with stale log entries from becoming leader, thereby ensuring any committed entry from a previous term will be preserved.
Furthermore, a leader never commits entries from previous terms by counting replicas alone. It only commits an entry from its own term once it is stored on a majority, or an entry from a previous term if it is also present in the leader's log. This subtle rule prevents a previously committed entry from being overwritten after a leader change. These safety properties collectively ensure State Machine Safety: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.
Cluster Membership Changes
Changing the set of servers in the cluster, such as adding or removing a node, must be done without compromising consensus safety. Raft addresses this through a joint consensus approach for transitions. The cluster first switches to an intermediate configuration that combines both the old and new sets of servers. This joint configuration is committed to the log using the standard log replication mechanism. Only after the joint consensus is committed does the cluster transition to the new configuration alone. This two-phase change ensures that a majority overlap exists throughout the process, preventing situations where two disjoint majorities could make independent decisions, which could lead to safety violations.
Log Compaction and Snapshotting
As the log grows indefinitely, storage becomes a practical concern. Raft employs snapshotting for log compaction. Periodically, each server takes a snapshot of its current state machine state, along with the last included index and term from the log. The entire prefix of the log up to that index can then be discarded. When a follower's log is far behind a leader's, the leader can send its snapshot using an InstallSnapshot RPC. The follower replaces its state with the snapshot and log tail from the leader. This mechanism allows Raft to operate efficiently over long-running systems without requiring unbounded storage.
Conclusion: The Value of Understandability
The Raft consensus algorithm, as presented in the Raft Guide, represents a significant achievement in distributed systems design. By prioritizing clarity and decomposability, it demystifies a core component of reliable distributed computing. Its strong leader model, clear separation of concerns into leader election, log replication, and safety, along with its meticulous handling of edge cases through mechanisms like the Election Restriction and joint consensus, provide a robust foundation for building fault-tolerant systems. The algorithm's explicit design for understandability does not come at the cost of performance or correctness; rather, it makes these properties more accessible, verifiable, and teachable. For engineers implementing replicated state machines, from key-value stores to configuration managers, Raft offers a practical, well-specified, and comprehensible path to achieving consensus.
France recognizes State of Palestine during UN meeting on two-state solutionCambodia promotes implementation of RCEP commitments on standards, trade facilitation
Trump blasts Supreme Court over blocking of deportations
Trump signs order modifying tariff rates with 69 trading partners amid criticism, lawsuits
U.S. vice president warns prolonged gov't shutdown will lead to layoffs
【contact us】
Version update
V9.10.611