Saga pattern: distributed transactions

Q: What is the saga pattern?

The saga pattern is a sequence of local transactions, each within a single service, that together achieve a business operation across multiple services. Unlike a distributed ACID transaction (which uses two-phase commit and holds locks across all services during the operation), a saga breaks the operation into steps where each step commits its local transaction and publishes an event. If a step fails, compensating transactions undo the work of previous steps. The saga pattern provides eventual consistency rather than atomicity — there is a window where the system is in an intermediate state. This is acceptable for most business operations (an e-commerce order can briefly show 'payment pending' before inventory is reserved) but not for operations requiring true atomicity.

Q: What is the difference between choreography and orchestration sagas?

Choreography saga: each service knows what to do when it receives an event. Service A completes its step and publishes an event; Service B listens for that event and starts its step; Service B publishes another event; Service C listens and continues. There is no central coordinator — the flow emerges from each service's reaction to events. Advantage: services are decoupled, simpler. Disadvantage: the overall saga flow is not visible in any one place — understanding the full flow requires reading all services' event handlers. Hard to monitor progress. Orchestration saga: a central orchestrator (a saga coordinator service) sends commands to each service in sequence and receives responses. The orchestrator knows the full state machine and handles failures centrally. Advantage: the full flow is visible and monitorable in one place. Disadvantage: the orchestrator is a central coordination point that all services depend on.

Q: What are compensating transactions?

A compensating transaction is the undo operation for a saga step that has already committed. Example: an order saga steps are (1) reserve inventory, (2) charge payment, (3) create shipment. If step 3 fails, the saga must run compensating transactions for steps 1 and 2: release the inventory reservation and refund the payment. Compensating transactions must be idempotent — if they're retried due to failure, running them twice must produce the same result as running them once. Not all operations are compensable: sending an email or notifying a third party cannot be undone. For non-compensable steps, design the saga to execute them last, after all compensable steps have committed — this minimizes the case where a non-compensable action occurs but the overall saga fails.

Q: What is the outbox pattern and why is it needed for sagas?

The outbox pattern solves the 'dual write problem': a service needs to both update its database and publish an event to a message queue. If the database update succeeds but the event publish fails (or vice versa), the saga gets stuck. The outbox pattern: instead of publishing directly to the queue, the service writes the event to an 'outbox' table in the same database transaction as the data update. A separate relay process reads from the outbox table and publishes to the message queue, retrying until success. The outbox event is then deleted or marked as processed. This guarantees that the database update and the event are always eventually in sync — they're in the same atomic transaction. The relay process provides at-least-once delivery; consumers must be idempotent to handle duplicate events.

When an operation spans multiple microservices, you cannot use a single database transaction. Two-phase commit (the distributed transaction protocol) holds locks across all participating services during the prepare phase — at scale, this creates contention and cascading timeouts. The saga pattern is the industry answer: break the operation into local transactions per service, with compensating transactions for rollback. This guide covers the mechanics, the two implementation approaches, and the failure modes that interviewers probe.

Why distributed transactions are hard

In a monolith, a multi-step operation (reserve inventory + charge payment + create shipment) executes in a single database transaction. If any step fails, the transaction rolls back atomically. No partial state, no compensating logic needed.

In a microservices architecture with three services (inventory-service, payment-service, fulfillment-service), each service has its own database. A "transaction" that spans all three would require two-phase commit (2PC): a coordinator asks all services to "prepare" (lock their data), then "commit" or "abort." Problems with 2PC: during the prepare phase, all three services hold locks on the relevant rows — a slow or failed service holds locks until the coordinator detects the failure and aborts. At scale, this creates latency cascades. 2PC is also a coordinator single point of failure. Most microservices architectures avoid 2PC entirely and use sagas instead.

Choreography saga

In a choreography saga, each service listens for events and reacts by performing its local transaction and publishing the next event. There is no central coordinator.

Example: e-commerce order saga. (1) Order service creates an order in PENDING state, publishes OrderCreated event. (2) Inventory service receives OrderCreated, reserves stock, publishes InventoryReserved event (or InventoryReservationFailed). (3) Payment service receives InventoryReserved, charges the customer, publishes PaymentCompleted or PaymentFailed. (4) Order service receives PaymentCompleted, marks order as CONFIRMED. If payment fails: (4a) Order service receives PaymentFailed, publishes InventoryRelease command. (4b) Inventory service releases the reservation.

Choreography works well for simple, linear sagas. Its weakness: the overall flow is distributed across all services — there's no single place to monitor saga progress or see the complete state machine. Debugging a stalled saga requires checking logs across all services. Cyclic event dependencies can emerge as sagas grow more complex.

Orchestration saga

In an orchestration saga, a central saga orchestrator (a dedicated service or a stateful workflow engine) commands each step and tracks the saga state.

Example: the same order saga, orchestrated. The saga orchestrator holds the state machine: PENDING → INVENTORY_RESERVED → PAYMENT_PROCESSING → COMPLETED (or INVENTORY_RELEASED → PAYMENT_REFUNDED → CANCELLED). The orchestrator sends commands to services and receives responses. It knows exactly which step the saga is in and handles failures by executing compensating commands.

Orchestration advantages: the saga state is visible in one place (queryable in the orchestrator's database). Failure handling is centralized — the orchestrator decides what compensating transactions to run. The flow is explicit in the orchestrator's code. Monitoring: track all in-flight sagas by querying the orchestrator's state store.

Orchestration disadvantages: the orchestrator is a dependency for all services in the saga — its failure stalls all in-flight sagas. It can become a bottleneck if not scaled appropriately. Technologies: Temporal, Conductor (Netflix), AWS Step Functions, or a custom saga state machine stored in a database.

Compensating transactions

A compensating transaction is the undo for a step that has already committed locally. Key properties:

Idempotent. Compensating transactions must be idempotent — executing them multiple times produces the same result as executing them once. If the compensating transaction for "release inventory reservation" is called twice (due to a retry), it must not release more inventory than was originally reserved. Implementation: check whether the reservation is already in a released state before attempting to release; use a unique compensation transaction ID to deduplicate.

Not all operations are compensable. Sending an email, notifying a third party, or publishing to an external API cannot be undone. For non-compensable steps: execute them last in the saga, after all compensable steps have committed. This minimizes the window where a non-compensable action has occurred but the overall saga fails. If a non-compensable action has occurred and the saga fails after it, the only option is a manual correction process or a "reverse" notification to the user.

Semantic rollback, not physical rollback. A compensating transaction doesn't undo the database row modification — it creates a new transaction that logically reverses the effect. A reserved inventory row is updated from RESERVED to AVAILABLE, not deleted. This preserves the audit trail — you can see that the reservation occurred and was subsequently released.

The outbox pattern: solving dual writes

Every saga step has a dual write problem: the service must both update its database and publish an event. If the database update succeeds but the message publish fails (network error, Kafka down), the saga stalls — the next service never receives the event to proceed.

The outbox pattern solves this atomically. Instead of publishing directly to Kafka, the service writes the event to an outbox table in the same database transaction as the data update. A separate relay process (often called the Transactional Outbox Processor) reads unprocessed outbox rows and publishes them to Kafka, retrying until success. The event is then marked as processed (or deleted). Since the outbox write is part of the same database transaction as the data update, they're always in sync — either both succeed or both fail.

The relay provides at-least-once delivery (it might publish the same event twice if it crashes between publishing and marking processed). Consumers must be idempotent — processing the same event twice must be safe. This is a foundational requirement for saga participants.

Frequently asked questions

What is the saga pattern?
A sequence of local transactions per service that together achieve a cross-service business operation. Failures trigger compensating transactions to undo previous steps. Provides eventual consistency, not atomicity. The alternative to two-phase commit in microservices.

What is the difference between choreography and orchestration sagas?
Choreography: services react to events, no central coordinator. Simple but hard to observe overall flow. Orchestration: central coordinator (Temporal, Step Functions) holds the state machine. Easier to monitor and debug. Use orchestration for complex sagas with many failure paths.

What are compensating transactions?
The undo operations for saga steps that have committed. Must be idempotent. Not all operations are compensable (email, external API). Execute non-compensable steps last to minimize failure after them. Compensating transactions create new DB records, not physical undos.

What is the outbox pattern and why is it needed for sagas?
Writes event to an outbox table in the same DB transaction as the data update. A relay process publishes to Kafka with retry. Solves dual-write atomicity without distributed coordination. Consumers must be idempotent for the at-least-once delivery guarantee.

Practice in SysSimulator → See sagas in payment system design

Next: System design interview framework →