Consensus Protocol
We can observe that the Gossip Protocol essentially does not guarantee Consistency, as the cluster can be divided into independent partitions.
Consensus
To build a CP (Consistency over Availability) system, many architectures adopt a protocol called Consensus. Consensus Protocol enables a group of nodes to agree on a value, even in the presence of failures ( Fault Tolerance ).
Consider a cluster of nodes.
When a new node attempts to join,
in a consensus-based system, it cannot simply ask a random node (e.g., Node A
) for admission:
Instead, the cluster must reach a collective decision to approve the new node.
This typically happens when a majority of nodes agree, for example,
if nodes A
and B
approve D
’s entry, it succeeds even if node C
is down.
The Consensus Protocol is an abstract theoretical concept with strict requirements. We won’t dive into all the theoretical details here. Instead, we’ll focus on a practical implementation: the Raft Consensus Algorithm
Raft Consensus Algorithm
Raft is a consensus protocol designed to manage a replicated log in distributed systems. It manages a cluster using two key principles:
Majority
Raft can tolerate network partitions as long as a majority of nodes remain connected. If the cluster splits into multiple partitions, the partition containing a majority continues to operate, while the minority partitions become unavailable.
Loss Of Quorum
What if the cluster splits into equal partitions (e.g., one node each)? In that case, none of them can achieve majority, resulting in total unavailability for writes. This situation is known as Loss Of Quorum.
CP Design
This is how we achieve CP (Consistency over Availability): A Raft cluster ensures only one partition (the one with a majority) can operate at a time. This guarantees that at any moment, there is a single writer, ensuring consistency.
Raft Cluster
Let’s explore how to build a distributed cluster using the Raft algorithm.
Leader Node
Raft revolves around the concept of electing a temporary leader. At any given time, a Raft cluster has at most one Leader (or none). All other nodes are Followers.
A node becomes leader if it wins a majority vote. It remains leader as long as it is reachable.
Heartbeat
How does the cluster detect that a leader has failed? Each node uses a fixed timeout value. The leader must periodically send heartbeats to followers. If a follower’s heartbeat expires, it assumes the leader is down.
Once a node suspects the leader has failed, it transitions to a Candidate and initiates an election. Any node in the cluster can do this.
For example, Node B
detects leader failure due to a heartbeat timeout,
it then becomes a Candidate.
Election Process
In the Raft election process, nodes vote for a candidate based on its term number, a logical counter representing election rounds.
- For example, a node with
Term = 3
has participated in three election rounds. - Nodes will only vote for candidates with higher terms, which ensures the system can always make progress.
Let’s walk through an example with three nodes and a timeout of 3 seconds. Suppose the leader has just become corrupted.
Step 1: Timeout and Candidacy
Node A
times out, transitions to a Candidate state, increments its term, and requests votes:
Step 2: Voting Rules
Nodes only vote for candidates with higher terms:
- If a node hasn’t voted or receives a higher term than itself, it will vote.
- If another candidate matches or has a lower term, the node ignores the request.
Node B
and Node C
both ignore Node A
’s vote request since their term is the same.
Node A
does not receive a majority and returns to the Follower state.
At time 00:04, Node B
times out, becomes a Candidate, and initiates a new election.
Node B
now has a higher term, so the other nodes vote for it and update their own terms accordingly.
Step 3: Leader Confirmation
A majority vote is confirmed, so Node B
becomes the new Leader.
After becoming leader, it sends regular heartbeat messages to all nodes to show it is alive.
Why is term number a reliable logical counter?
- Nodes increment their term after each timeout, so a node with a lower term must have participated in fewer election cycles than one with a higher term.
- If a node adopts a higher term from another node, that implies the node has fallen behind the higher node.
Split Vote
What if no candidate achieves majority? This results in a Split Vote:
- The cluster may retry with new terms and timeouts.
- Some implementations may select a random leader.
Log Replication
Once a leader is elected, all state changes go through it. The leader writes changes to its log first and then replicates them to followers.
For example, when a new node wants to join, it contacts the leader: The leader logs the change and then replicates the update to others.