Delivery Semantics

Numerous challenges arise when committing changes to both Event Streaming Platform and other data sources simultaneously. These two steps are independent, and failures between them can lead to inconsistencies.

For example, a consumer might successfully process an event and apply changes to another data store but crash before committing the event offset. Upon recovery, the consumer may reprocess the same event unexpectedly.

ConsumerStorePartitionPull an eventMake and commit changesCrash and cannot commit offsetRecoverPull and process the event again

Delivery semantics define the guarantees provided for event delivery during production and consumption. There are three main types, each offering different trade-offs between latency, durability and reliability.

At-most-once Delivery

This delivery model ensures that an event is delivered zero or one time.

Producer

The producer sends events with ACK=0 to achieve the lowest latency. Even if a request fails, it won’t be retried to avoid duplication.

For example, if the producer doesn’t receive a response from the broker due to a network error and retries the operation, duplicated events may occur.

ProducerPartitionProduce an eventRespond but the producer cannot receiveTimeoutRetry to produce the event again

To prevent this, retries are disabled, even in failure scenarios.

ProducerPartitionProduce an eventRespond but the producer cannot receiveContinue without retry

Consumer

The consumer commits the event before handling it. This approach ensures the event won’t be processed more than once if the consumer crashes before committing.

For instance, if a consumer processes an event successfully but crashes before committing, and the commit is delayed, the event will be reprocessed after recovery.

ConsumerPartitionConsume an eventHandle the eventCrash and fail to commit offsetRecoverConsume the event again

Therefore, events must be committed before they are processed. As a result, if the consumer fails to process an event, it will not attempt to consume that event again.

ConsumerPartitionConsume an eventCommit offset immediatelyHandle the event

This model guarantees delivery at most once and offers the lowest latency. It is suitable for scenarios where data loss is acceptable, such as metrics collection.

At-least-once Delivery

This model guarantees that every event is delivered at least once, possibly more.

Producer

The producer uses ACK=1 or ALL and enables retries on failure to ensure event persistence. Clearly, enabling retries can lead to duplicate events.

ProducerPartitionProduce an eventRespond but the producer cannot receiveTimeoutRetry to produce the event (duplicated)

Consumer

The consumer commits the offset after processing the event.

ConsumerPartitionConsume an eventProcess the eventCommit offset

If it crashes before committing, the event may be reprocessed.

ConsumerPartitionConsume an eventProcess the eventCrash and cannot commit offsetRecoverConsume the event again

This method provides stronger durability guarantees but may lead to duplicates and higher latency compared to at-most-once delivery. It is well-suited for scenarios where duplicate data is acceptable or can be handled, such as in user activity tracking.

Exactly-once Delivery

This is the most reliable but also the most complex delivery model, ensuring each event is delivered exactly once. Event Streaming Platform solutions don’t fully support this out of the box, so additional techniques are required.

Exactly-once Producer

The producer functions similarly to the at-least-once model, allowing retries on failures and using ACK=ALL for durability.

To prevent duplication, it uses idempotency keys. Each producer is assigned a PID (producer ID) and a seq (sequence number), which it increments locally after receiving an acknowledgment.

Example:

Producer (P1)PartitionRegisterProduce an event (seq = 0)Respondseq = 0PID = P1, seq = 0PID = P1, seq = 0 -> 1seq = 0 -> 1

The partition ignores any events with outdated sequence numbers, effectively preventing duplicates.

For example, if P1 sends an event but fails to receive an acknowledgment, it will resend the event. Because the event’s sequence number is outdated, the partition ignores it, avoiding duplication.

Producer (P1)PartitionProduce an event (seq = 1)Fail to receive the acknowledgementTimeoutRetry to produce the event (seq = 1)Ignore (producer seq = 2 > event seq = 1)seq = 1PID = P1, seq = 1PID = P1, seq = 1 -> 2seq = 1 -> 2

Exactly-once Consumer

Event Streaming Platform cannot tell whether a consumer has processed an event or not. To achieve exactly-once semantics, we must introduce one of the following approaches:

Consume–Process–Produce Pipeline

If the event is simply transformed into another event (with no external side effects), Event Streaming Platform can manage this flow.

PartitionConsumerEventNew eventConsumeTransform to a new event
Transactional Commit

To support this, Transactional Commit is implemented, ensuring that consumers can only see committed events. If a failure occurs during processing, the transaction is aborted and all uncommitted changes are discarded.

ConsumerPartition1. Begin a transaction2. Consume an event (Uncommitted)3. Handle the event4. Produce a new event (Uncommitted)5. Commit the transaction

This guarantees the transformation process won’t produce duplicates. However, it only applies to stateless processing with no external side effects.

Two-phase Commit

To synchronize multiple systems, we can use Two-phase Commit :

ConsumerPartitionExternal StorePREPAREPREPAREConsume an eventMake changesVOTE YESVOTE YESCOMMITCOMMIT

However, Two-phase Commit introduces the risk of infinite blocking and is not always supported. This is discussed further in this topic.

Event Idempotency

This approach requires each event to carry a unique identifier. Consumers then check for this ID before processing.

For example, a consumer verifies whether an event has already been processed before handling it.

External StoreConsumerEvent StreamingConsume "event-1"Handle and save "event-1"The consumer crashes, offset not committedRecoverConsume "event-1" againCheck and find "event-1" already storedCommit the event

This approach is usually preferred over Two-phase Commit due to its simplicity and better fault tolerance. It is commonly used in combination with the Saga pattern to manage long-running, distributed operations.

However, this method does not provide strong consistency across the system, there can be consistency drift between the streaming platform and the external data store. We’ll explore this limitation in more detail in the Distributed Transaction topic.

Ultimately, exactly-once delivery requires additional mechanisms and is best suited for mission-critical systems where both data loss and duplication are unacceptable (e.g., banking platforms).

Last updated on