Gossip Protocol

We’ll explore the first approach to maintaining a peer-to-peer cluster - Gossip Protocol .

Eager Reliable Broadcast Protocol

Eager Reliable Broadcast is a communication protocol in distributed systems that ensures messages are broadcast to all participants. In this protocol, each peer continuously exchanges information with every other peer.

ClusterServer 1Server 2Server 3Server 4

This protocol follows an eager delivery model, where messages are immediately sent to all peers upon generation. While effective, this approach becomes inefficient as the number of peers grows, it consumes substantial bandwidth and system resources due to redundant message exchanges.

To address these inefficiencies, the Gossip Protocol was introduced, reducing resource usage by limiting the number of exchanges.

Gossip Protocol

The Gossip Protocol is akin to how rumors spread in an office. A peer starts by sharing a message with a few randomly selected peers. These peers then forward the message to others, eventually reaching all nodes in the cluster.

For example, Server 1 first informs Server 2 and Server 3 of a piece of information. They then propagate it to Server 4. Eventually, all servers in the cluster acknowledge the information.

ClusterClusterClusterServer 1Server 2Server 3Server 4Server 1Server 2Server 3Server 4Server 1Server 2Server 3Server 4

Maintaining Cluster Membership

Each server maintains information such as addresses, states, and shard allocations of other nodes.

For example:

ClusterA (1.1.1.1)B (2.2.2.2)B:  State: UP  Address: 2.2.2.2B:  State: UP  Address: 2.2.2.2A:  State: UP  Address: 1.1.1.1A:  State: UP  Address: 1.1.1.1

Adding a Node

Bootstrapping nodes serve as entry points for new nodes, typically accessed via static IPs or DNS.

In this example, A acts as a bootstrap node. When node C joins, it sends its information to A and receives the current cluster metadata.

ClusterC (3.3.3.3)A (1.1.1.1)B (2.2.2.2)B:  State: UP  Address: 2.2.2.2B:  State: UP  Address: 2.2.2.2A:  State: UP  Address: 1.1.1.1A:  State: UP  Address: 1.1.1.1Ask to joinClusterC (3.3.3.3)A (1.1.1.1)B (2.2.2.2)A:  State: UP  Address: 1.1.1.1B:  State: UP  Address: 2.2.2.2A:  State: UP  Address: 1.1.1.1B:  State: UP  Address: 2.2.2.2B:  State: UP  Address: 2.2.2.2C:  State: UP (Newly added)  Address: 3.3.3.3B:  State: UP  Address: 2.2.2.2C:  State: UP (Newly added)  Address: 3.3.3.3A:  State: UP  Address: 1.1.1.1A:  State: UP  Address: 1.1.1.1Cluster information

A then informs B of C’s arrival, allowing B to update its view of the cluster.

ClusterC (3.3.3.3)A (1.1.1.1)B (2.2.2.2)A:  State: UP  Address: 1.1.1.1B:  State: UP  Address: 2.2.2.2A:  State: UP  Address: 1.1.1.1B:  State: UP  Address: 2.2.2.2B:  State: UP  Address: 2.2.2.2C:  State: UP  Address: 3.3.3.3B:  State: UP  Address: 2.2.2.2C:  State: UP  Address: 3.3.3.3A:  State: UP  Address: 1.1.1.1C:  State: UP (Newly added)  Address: 3.3.3.3A:  State: UP  Address: 1.1.1.1C:  State: UP (Newly added)  Address: 3.3.3.3Inform of C

Now, all members are aware of C, meaning it has successfully joined the cluster. Based on this information, the system can redistribute data if necessary.

ClusterA (1.1.1.1)B (2.2.2.2)C (3.3.3.3)B:  State: UP  Address: 2.2.2.2C:  State: UP  Address: 3.3.3.3B:  State: UP  Address: 2.2.2.2C:  State: UP  Address: 3.3.3.3A:  State: UP  Address: 1.1.1.1C:  State: UP  Address: 3.3.3.3A:  State: UP  Address: 1.1.1.1C:  State: UP  Address: 3.3.3.3A:  State: UP  Address: 1.1.1.1B:  State: UP  Address: 2.2.2.2A:  State: UP  Address: 1.1.1.1B:  State: UP  Address: 2.2.2.2

Forwarding

With consistent hashing and shared metadata, any node can act as an interface:

  • Serving read requests directly or forwarding them to respective replicas.
  • Routing write requests to the owner node.
ClientClusterServer 1Server 21. Write a record2. Calculate appropriate node3. Forward

Gossip Rounds

Gossip Rounds form the backbone of the protocol. Periodically, each node selects random peers and shares its current state (e.g., heartbeat), progressively spreading updates throughout the cluster.

For example, at time 3, A gossips its state. B receives it and transmits to C.

Cluster (00:03)Cluster (A gossips at 00:03)ABCABCB:  State: UP  Heartbeat: 1C:  State: UP  Heartbeat: 1B:  State: UP  Heartbeat: 1C:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 1C:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 1C:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 1B:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 1B:  State: UP  Heartbeat: 1B:  State: UP  Heartbeat: 1C:  State: UP  Heartbeat: 1B:  State: UP  Heartbeat: 1C:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 3 (New)C:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 3 (New)C:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 3 (New)B:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 3 (New)B:  State: UP  Heartbeat: 1A: UPA: UP

Later, B also gossips its state to random nodes.

Cluster (B gossips at 00:03)ABCB:  State: UP  Heartbeat: 3 (New)C:  State: UP  Heartbeat: 1B:  State: UP  Heartbeat: 3 (New)C:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 3C:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 3C:  State: UP  Heartbeat: 1A:  State: UP  Heartbeat: 3B:  State: UP  Heartbeat: 3 (New)A:  State: UP  Heartbeat: 3B:  State: UP  Heartbeat: 3 (New)B: UPB: UP

Failure Detection

One key advantage of Gossip Rounds is fault detection. If a node doesn’t receive a heartbeat from another node within a defined time window, it marks that peer as DOWN.

If the heartbeat timeout is 3 seconds, by time 5, both A and B consider C down due to the absence of updates.

Cluster (HeartbeatLifetime = 3, Time = 5)ABCB:  State: UP  Heartbeat: 3C:  State: DOWN  Heartbeat: 1 (Expired)B:  State: UP  Heartbeat: 3C:  State: DOWN  Heartbeat: 1 (Expired)A:  State: UP  Heartbeat: 3C:  State: DOWN  Heartbeat: 1 (Expired)A:  State: UP  Heartbeat: 3C:  State: DOWN  Heartbeat: 1 (Expired)Mark DOWNMark DOWN

Temporary Promotion

When a node is deemed unreachable, the detector stops forwarding data to it and switches to a replica. Once the original node recovers, it synchronizes with the replica to restore data.

Node failureRecoveryServer AServer C (Primary)Server B (Replica)Server AServer C (Primary)Server B (Replica)Stop forwardingFailoverBack to primaryPull data to recover

Data Conflicts

This architecture emphasizes Availability over Consistency (AP). During network partitions, replicas can be promoted to serve writes temporarily. However, once partitions heal, conflict resolution becomes necessary.

Two effective approaches to conflict resolution are:

Last Write Wins (LWW)

This method uses timestamps to resolve conflicts. Each record tracks the latest update time, and the most recent version wins during a merge.

Server 1Server 2Id: 123Name: MikeId: 123Name: MikeId: 123Name: JohnUpdated: 00:01Id: 123Name: JohnUpdated: 00:01Id: 123Name: MikeUpdated: 00:03Id: 123Name: MikeUpdated: 00:03Server 2 wins

This strategy relies on clock synchronization. If clocks are skewed, incorrect records might be selected.

Vector Clocks

A more robust alternative, Vector Clocks, avoids dependency on synchronized clocks. Instead, each record tracks a version vector in the format [(Server, Version)].

For example, a record initially receives its version number from its owner shard. Other servers then replicate the record from the owner, preserving this version for consistency.

ClusterServer 1Server 2Server 3Id: 123Name: JohnVL:Server 1: 3Id: 123Name: JohnVL:Server 1: 3Id: 123Name: JohnVL:Server 1: 3Id: 123Name: JohnVL:Server 1: 3Id: 123Name: JohnVL:Server 1: 3Id: 123Name: JohnVL:Server 1: 3

Then, a network partition occurs, the cluster is divided into three partitions. Each group may update the record independently, expanding the vector clock;

  • Server 1 updates and increases its version.
  • Server 3 updates and creates its own version.
ClusterClient 1Client 2Partition 1Partition 2Partition 3Server 1Server 2Server 3Id: 123Name: John DoeVL:Server 1: 4Id: 123Name: John DoeVL:Server 1: 4Id: 123Name: JohnVL:Server 1: 3Id: 123Name: JohnVL:Server 1: 3Id: 123Name: John WickVL:Server 1: 3Server 3: 1Id: 123Name: John WickVL:Server 1: 3Server 3: 1SET Name = John DoeSET Name = John Wick

A conflict is identified when vectors cannot be merged deterministically:

  • [(Peer 1, 4)] is a clear successor of [(Peer 1, 3)].
  • But [(Peer 1, 4)] and [(Peer 1, 3), (Peer 3, 1)] clearly conflict.

The system maintains both versions:

ClusterServer 1Server 2Server 3Id: 123Name: John DoeVL:Server 1: 4Id: 123Name: John DoeVL:Server 1: 4Id: 123Version 1:  Name: John Doe  VL:  Server 1: 4Version 2:  Name: John Wick  VL:  Server 1: 3  Server 3: 1Id: 123Version 1:  Name: John Doe  VL:  Server 1: 4Version 2:  Name: John Wick  VL:  Server 1: 3  Server 3: 1Id: 123Name: John DoeVL:Server 1: 4Id: 123Name: John DoeVL:Server 1: 4Id: 123Name: JohnVL:Server 1: 3Id: 123Name: JohnVL:Server 1: 3Id: 123Name: John WickVL:Server 1: 3Server 3: 1Id: 123Name: John WickVL:Server 1: 3Server 3: 1OverrideConflictConflict

Application-Level Resolution

The application is responsible for resolving conflicts. It receives all conflicting versions and decides how to merge them.

For instance, it may choose to accept the first version:

DatabaseId: 123Version 1:  Name: John Doe  VL:  Server 1: 4Version 2:  Name: John Wick  VL:  Server 1: 3  Server 3: 1Id: 123Version 1:  Name: John Doe  VL:  Server 1: 4Version 2:  Name: John Wick  VL:  Server 1: 3  Server 3: 1ApplicationId: 123Name: John DoeVL:Server 1: 4Id: 123Name: John DoeVL:Server 1: 4Accept first version

Since the database lacks context about the business logic, deferring resolution to the application ensures safer and more flexible conflict handling. Thus, applications should include additional metadata in their records to assist in this process.

Last updated on