Compensation Protocols

This section delves into simpler protocols that operate abstractly, without relying on low-level concepts.

In essence, Compensation Protocols enable transactions to commit data in a reversible manner. If an issue arises later, these transactions can roll back the system by compensating for the previously committed changes.

Designing these protocols is challenging because they don’t inherently meet the Isolation requirement. The Commit and Compensate phases often behave as if they are parts of separate transactions. If the design is flawed, other operations might update data before the Compensate phase. This can lead to the use of skewed data and the generation of inconsistencies. Furthermore, this separation makes it impossible to provide Strong Consistency, a strict requirement in some critical systems.

Try-Confirm/Cancel (TCC)

Try-Confirm/Cancel (TCC) is an approach similar to Two-Phase Commit (2PC) but operates without locking mechanisms. It also involves two phases coordinated by a central entity:

Try: The coordinator instructs participants to perform tentative actions, such as reserving resources. It’s important to note that data is actually committed during this phase, not merely marked as dirty data.
Confirm or Cancel:
- Confirm: If all participants successfully complete their Try actions, the coordinator requests them to confirm, thereby finalizing the rest of the operations.
- Cancel: If any participant fails to prepare, other participants will revert their changes.

Consider a transaction transferring money between accounts located on different servers:

Try Phase

During the Try phase, the coordinator initiates the transaction by sending Try requests to all participating services (in this case, both banks).

Account A’s balance is decreased by the transaction amount.
The system verifies if Account B is valid and can receive funds.

Confirm Phase

If all participants successfully complete their Try operations. It sends Confirm requests to all participants to finalize the transaction.

In this example, Account B will now definitively increase its balance by the amount debited from Account A. The transaction is considered successful and complete.

Cancel Phase

For instance, if Account B is blocked by the bank, Server B would respond with No. The coordinator then instructs Server A to compensate for the change.

You might wonder why Account A’s balance is updated in the Try phase, but Account B’s is not. The Try phase must not introduce harmful effects to the system. It would be a poor design to increase Account B’s balance initially, as this temporary increment shouldn’t be usable until it’s validated.

Similar to 2PC, the primary advantage of TCC is its support for parallel processing, allowing steps to be performed simultaneously. However, TCC creates tight couplings between services because they need to expose low-level Try-Confirm-Cancel interfaces. This necessitates a deep understanding of the participating services.

Saga Pattern

Saga can be considered the most straightforward solution for distributed transactions. In a saga, transactions are broken down into a series of steps executed sequentially. If any step fails, the process moves backward to compensate for the preceding successful steps.

For instance, imagine an e-commerce system with three services. When a user places an order with the Order Service:

The Order Service creates the order record.
The Stock Service reduces the quantity of the ordered items.
The Payment Service completes the payment.

There are two primary ways to implement this request using Saga:

Orchestration Saga

In this approach, a central coordinator is established, much like in TCC or 2PC. The coordinator executes the actions in sequence. If any action fails, it orchestrates the backward compensation of previous actions.

When everything proceeds smoothly, the process might look like this:

If, for example, the payment step fails, other services will be instructed to compensate for the previous steps:

This seems straightforward, as it mirrors real-life sequential processes without partial operations or locking.

Again, why is the payment processed last? Payment features are often implemented via third-party solutions. Executing the payment earlier would make compensation more difficult. Therefore, internal tasks are typically completed first.

However, this sequential nature means the Saga pattern, does not inherently support parallelism. A transaction’s operations must be performed in order, even if some could be productively parallelized.

Choreography Saga

Choreography Saga implements the pattern asynchronously. A transaction is completed through the collaboration of relevant events exchanged along the way.

Continuing with the e-commerce example, services communicate implicitly through an event stream:

In case of a failure, compensation events are generated. For example, if the payment step fails, it triggers a chain of compensation events for the preceding steps:

Orchestration requires a central coordinator and direct communication, which can degrade availability and introduce a Single Point of Failure . However, it conveniently centralizes the transaction logic within the coordinator. In contrast, Choreography distributes the transaction logic across different services, potentially making the source code difficult to understand and maintain.

Although Choreography can enhance availability and decouple the system, its complexity can be overwhelming and may outweigh the benefits. Moreover, Saga is inherently slower due to its sequential nature, and the Choreography approach can exacerbate this with its indirect communication paradigm.

Saga Modelling

As stated earlier, Saga is a compensation protocol and does not offer the Isolation property. Before compensation occurs, dirty data can create vulnerabilities in the system. Therefore, the most critical aspect of Saga is defining a safe and reliable workflow.

First and foremost, it’s crucial to adopt the mindset that any step can fail occasionally. When a step fails, we must ensure it doesn’t jeopardize the system.
Second, not every step in a workflow can be compensated, especially when external factors are involved. Imagine a step that transfers money to a user’s bank account. Reverting this step would require retrieving the amount from the bank, which is nearly impossible as a bank typically won’t permit withdrawals without user consent.

Hence, in a saga, actions should be categorized into three groups:

Compensable Action: Can be rolled back using a corresponding compensation action if necessary. These are usually internal workloads that can readily invalidate previous actions.
Pivot Action: Represents a point of no return in a workflow. Once it succeeds, all subsequent actions must be completed, and no compensation can be applied. It’s typically the last Compensable action in a sequence.
Retryable Action: Can be safely retried multiple times without causing inconsistencies. This action is irreversible and is expected to be retried until successful.

Essentially, Saga workflows are designed as follows. After the pivot step is completed, compensation is no longer an option.

Back to the e-commerce example, we’ll sort the transactions as follows

The Order Service makes an order CreateOrder().
The Stock Service reserves the number of ordered items ReserveItem().
The Payment Service processes the payment ProcessPayment().
Finally, the Delivery Service creates a request CreateDeliveryRequest().

The first two steps are internal processes, they’re safe to compensate. Payment and delivery requirements usually depend on third-party solutions, it may be not feasible to revert them. After the payment step is finished, we have to guarantee the completion of the final step.

However, this workflow varies based on the system. If the delivery task is a part of the system, we may place it before the payment step.

Saga Serialization Anomalies

Normally, multiple sagas can be executed concurrently and modify the same data. That easily leads to serialization anomalies, like what we’ve discussed in the Concurrency Control topic.

We’ll briefly consider some common anomalies and how to resolve them.

Dirty Read

The first anomaly is Dirty Read. Similar to a rollback, the compensation phase can cause data to become dirty for transactions that read it before compensation.

Let’s consider an example: When a user places a large order, a discount voucher is issued to them.

Unfortunately, the final step (payment) fails. The system needs to revert the second step by deleting the voucher. However, before the deletion occurs, another saga might use that voucher.

The result is unexpected because an invalid voucher was used. There are several ways to resolve this:

Rearrange the flow: Move the voucher creation to the Retryable phase. Although it’s an internal and compensable workload, its potential for causing issues makes it safer in the Retryable phase.
This solution is known as the Pessimistic View, which assumes that actions prone to causing harm are likely to be compensated and should thus be deferred to a later point.
Lock data: Employ a flag field that acts as a lock. For example, when a new voucher is created, its state is set to Pending, marking it as unavailable for use. When the saga that created the voucher completes successfully, it changes the state to Approved, allowing usage.
This solution is called Semantic Locking, where application-level locking is implemented to prevent anomalies.

The second approach may be a bit overhead for this example. Let’s move to the next anomaly to see its real power.

Lost Update

The next anomaly is Lost Update. This is a specific case of Write Skew, where updates are unexpectedly overwritten by other concurrent operations.

For example, the OrderService creates an order, which is marked Approved at the end of its saga. However, in the interim, a CancelOrder Saga executes and sets the order’s state to Cancelled. Consequently, the CreateOrder Saga might overwrite the change from the CancelOrder saga, resulting in a cancelled order still being processed by the system.

Semantic Locking is an effective approach for preventing Lost Update. As before, orders are locked with a Pending state until they are successfully created.

Semantic Locking is nothing but a distributed lock, reducing parallelism and negatively affecting the performance. In the first place, we should design a reliable flow and limit locking instead.

Saga Transaction Recovery

Transaction State

Both the coordinator and participating services can crash at any time. Therefore, they must persist the state of transactions to enable retries if necessary.

Making transactions idempotent (where each transaction is marked with a unique identifier) is an effective way to prevent duplications and aid in transaction recovery upon failure.

Coordinator Recovery

The coordinator service may periodically scan its state store to complete any unfinished transactions. Continuing with the previous example, the Order Service keeps track of the current transaction step:

Order Service (coordinator):
  transaction-1234:
    Step: Payment Service
    State: Processing

For added safety, the transaction state is also duplicated in the participating services:

Order Service (coordinator):
  transaction-1234:
    Step: Payment Service
    State: Processing
Stock Service:
  transaction-1234: Complete
Payment Service:
  transaction-1234: Complete

For example, if the Order Service (coordinator) fails before receiving a response from the Payment Service, this can lead to mismatched states (Processing and Complete) across different services.

After restarting, the Order Service will observe that the transaction is pending and will attempt to call the Payment Service again to complete it. Fortunately, duplicating the state ensures that the Payment Service will not reprocess the transaction.

Choreographer Recovery

Consume-process-produce Pipeline

This pattern was discussed in the Event Streaming section. The Choreography Saga helps to avoid duplication without relying on idempotency in this specific pipeline.

The Consume-Process-Produce Pipeline is applicable when a service’s changes are confined to the event stream and do not affect other datasets.

Transactional Outbox Pattern

Another challenge with Choreography Saga is ensuring that database changes are effectively published as events. For example, when a transaction is executed in an internal data store, we want to publish an associated event indicating its completion.

The Transactional Outbox Pattern addresses this by storing external calls (event publications) as part of the internal database transaction.

For example, the transaction that creates an order also creates an associated record in the Outbox table, instead of immediately firing the event.

A separate process (called a Relay Process) periodically scans the Outbox table and publishes any unprocessed events.

If the relay process crashes before updating the outbox record in the third step, it might publish the event twice. Fortunately, consumers can use an idempotency key (e.g., OrderId) to check for and ignore duplicated events.

Last updated on June 26, 2025

Blocking Protocols Caching Patterns