Consensus Protocol

We can observe that the Gossip Protocol essentially does not guarantee Consistency, as the cluster can be divided into independent partitions.

Consensus

To build a CP (Consistency over Availability) system, many architectures adopt a protocol called Consensus. Consensus Protocol enables a group of nodes to agree on a value, even in the presence of failures ( Fault Tolerance ).

Consider a cluster of nodes. When a new node attempts to join, in a consensus-based system, it cannot simply ask a random node (e.g., Node A) for admission:

D (New node)ClusterNode ANode BNode CJoin

Instead, the cluster must reach a collective decision to approve the new node. This typically happens when a majority of nodes agree, for example, if nodes A and B approve D’s entry, it succeeds even if node C is down.

D (New node)ClusterNode ANode BNode C (down)ApproveJoin

The Consensus Protocol is an abstract theoretical concept with strict requirements. We won’t dive into all the theoretical details here. Instead, we’ll focus on a practical implementation: the Raft Consensus Algorithm

Raft Consensus Algorithm

Raft is a consensus protocol designed to manage a replicated log in distributed systems. It manages a cluster using two key principles:

Majority

Raft can tolerate network partitions as long as a majority of nodes remain connected. If the cluster splits into multiple partitions, the partition containing a majority continues to operate, while the minority partitions become unavailable.

ClusterNetwork partitionNode ANode BNode CPartition 1 (Available)Partition 2 (Unavailable)Node ANode BNode CDisconnectDisconnect

Loss Of Quorum

What if the cluster splits into equal partitions (e.g., one node each)? In that case, none of them can achieve majority, resulting in total unavailability for writes. This situation is known as Loss Of Quorum.

Network partitionPartition 1 (Unavailable)Partition 2 (Unavailable)Partition 3 (Unavailable)Node ANode BNode C

CP Design

This is how we achieve CP (Consistency over Availability): A Raft cluster ensures only one partition (the one with a majority) can operate at a time. This guarantees that at any moment, there is a single writer, ensuring consistency.

Raft Cluster

Let’s explore how to build a distributed cluster using the Raft algorithm.

Leader Node

Raft revolves around the concept of electing a temporary leader. At any given time, a Raft cluster has at most one Leader (or none). All other nodes are Followers.

A node becomes leader if it wins a majority vote. It remains leader as long as it is reachable.

Heartbeat

How does the cluster detect that a leader has failed? Each node uses a fixed timeout value. The leader must periodically send heartbeats to followers. If a follower’s heartbeat expires, it assumes the leader is down.

Once a node suspects the leader has failed, it transitions to a Candidate and initiates an election. Any node in the cluster can do this.

For example, Node B detects leader failure due to a heartbeat timeout, it then becomes a Candidate.

Cluster (Timeout = 3 seconds, Current time = 00:04)Node ANode BHeartbeat: 00:02Leader: UPState: FollowerHeartbeat: 00:02Leader: UPState: FollowerHeartbeat: 00:00Leader: DOWNState: CandidateHeartbeat: 00:00Leader: DOWNState: Candidate

Election Process

In the Raft election process, nodes vote for a candidate based on its term number, a logical counter representing election rounds.

  • For example, a node with Term = 3 has participated in three election rounds.
  • Nodes will only vote for candidates with higher terms, which ensures the system can always make progress.

Let’s walk through an example with three nodes and a timeout of 3 seconds. Suppose the leader has just become corrupted.

Current time = 00:03, Timeout = 3sNode ANode BNode CState: FollowerTerm: 0Heartbeat: 00:00State: FollowerTerm: 0Heartbeat: 00:00State: FollowerTerm: 1Heartbeat: 00:01State: FollowerTerm: 1Heartbeat: 00:01State: FollowerTerm: 1Heartbeat: 00:02State: FollowerTerm: 1Heartbeat: 00:02

Step 1: Timeout and Candidacy

Node A times out, transitions to a Candidate state, increments its term, and requests votes:

Current time = 00:03, Timeout = 3sNode A times outNode BNode CState: Follower -> CandidateTerm: 0 -> 1Heartbeat: 00:00State: Follower -> CandidateTerm: 0 -> 1Heartbeat: 00:00State: FollowerTerm: 1Heartbeat: 00:01State: FollowerTerm: 1Heartbeat: 00:01State: FollowerTerm: 1Heartbeat: 00:02State: FollowerTerm: 1Heartbeat: 00:02

Step 2: Voting Rules

Nodes only vote for candidates with higher terms:

  • If a node hasn’t voted or receives a higher term than itself, it will vote.
  • If another candidate matches or has a lower term, the node ignores the request.

Node B and Node C both ignore Node A’s vote request since their term is the same. Node A does not receive a majority and returns to the Follower state.

Current time = 00:03, Timeout = 3sNode ANode BNode CState: Candidate -> FollowerTerm: 1Heartbeat: 00:01State: Candidate -> FollowerTerm: 1Heartbeat: 00:01State: FollowerTerm: 1Heartbeat: 00:01State: FollowerTerm: 1Heartbeat: 00:01State: FollowerTerm: 1Heartbeat: 00:02State: FollowerTerm: 1Heartbeat: 00:02Ignores the votingIgnores the voting

At time 00:04, Node B times out, becomes a Candidate, and initiates a new election.

Current time = 00:04, Timeout = 3sNode ANode BNode CState: FollowerTerm: 1Heartbeat: 00:01State: FollowerTerm: 1Heartbeat: 00:01State: Follower -> CandidateTerm: 1 -> 2Heartbeat: 00:01State: Follower -> CandidateTerm: 1 -> 2Heartbeat: 00:01State: FollowerTerm: 1Heartbeat: 00:02State: FollowerTerm: 1Heartbeat: 00:02

Node B now has a higher term, so the other nodes vote for it and update their own terms accordingly.

Current time = 00:04, Timeout = 3sNode ANode CNode BState: FollowerTerm: 1 -> 2Heartbeat: 00:01State: FollowerTerm: 1 -> 2Heartbeat: 00:01State: FollowerTerm: 1 -> 2Heartbeat: 00:02State: FollowerTerm: 1 -> 2Heartbeat: 00:02State: CandidateTerm: 2Heartbeat: 00:01State: CandidateTerm: 2Heartbeat: 00:01Vote forVote for

Step 3: Leader Confirmation

A majority vote is confirmed, so Node B becomes the new Leader. After becoming leader, it sends regular heartbeat messages to all nodes to show it is alive.

Current time = 00:04, Timeout = 3sNode ANode CNode BState: FollowerTerm: 2Heartbeat: 00:01 -> 00:04State: FollowerTerm: 2Heartbeat: 00:01 -> 00:04State: FollowerTerm: 2Heartbeat: 00:02 -> 00:04State: FollowerTerm: 2Heartbeat: 00:02 -> 00:04State: Candidate -> LeaderTerm: 2State: Candidate -> LeaderTerm: 2Heartbeat at 00:04Heartbeat at 00:04

Why is term number a reliable logical counter?

  • Nodes increment their term after each timeout, so a node with a lower term must have participated in fewer election cycles than one with a higher term.
  • If a node adopts a higher term from another node, that implies the node has fallen behind the higher node.

Split Vote

What if no candidate achieves majority? This results in a Split Vote:

  • The cluster may retry with new terms and timeouts.
  • Some implementations may select a random leader.

Log Replication

Once a leader is elected, all state changes go through it. The leader writes changes to its log first and then replicates them to followers.

For example, when a new node wants to join, it contacts the leader: The leader logs the change and then replicates the update to others.

ClusterD (New node)LeaderFollower 1Follower 2JoinReplicateReplicate
Last updated on