When an operation spans multiple microservices, you cannot use a single database transaction. Two-phase commit (the distributed transaction protocol) holds locks across all participating services during the prepare phase — at scale, this creates contention and cascading timeouts. The saga pattern is the industry answer: break the operation into local transactions per service, with compensating transactions for rollback. This guide covers the mechanics, the two implementation approaches, and the failure modes that interviewers probe.
In a monolith, a multi-step operation (reserve inventory + charge payment + create shipment) executes in a single database transaction. If any step fails, the transaction rolls back atomically. No partial state, no compensating logic needed.
In a microservices architecture with three services (inventory-service, payment-service, fulfillment-service), each service has its own database. A "transaction" that spans all three would require two-phase commit (2PC): a coordinator asks all services to "prepare" (lock their data), then "commit" or "abort." Problems with 2PC: during the prepare phase, all three services hold locks on the relevant rows — a slow or failed service holds locks until the coordinator detects the failure and aborts. At scale, this creates latency cascades. 2PC is also a coordinator single point of failure. Most microservices architectures avoid 2PC entirely and use sagas instead.
In a choreography saga, each service listens for events and reacts by performing its local transaction and publishing the next event. There is no central coordinator.
Example: e-commerce order saga. (1) Order service creates an order in PENDING state, publishes OrderCreated event. (2) Inventory service receives OrderCreated, reserves stock, publishes InventoryReserved event (or InventoryReservationFailed). (3) Payment service receives InventoryReserved, charges the customer, publishes PaymentCompleted or PaymentFailed. (4) Order service receives PaymentCompleted, marks order as CONFIRMED. If payment fails: (4a) Order service receives PaymentFailed, publishes InventoryRelease command. (4b) Inventory service releases the reservation.
Choreography works well for simple, linear sagas. Its weakness: the overall flow is distributed across all services — there's no single place to monitor saga progress or see the complete state machine. Debugging a stalled saga requires checking logs across all services. Cyclic event dependencies can emerge as sagas grow more complex.
In an orchestration saga, a central saga orchestrator (a dedicated service or a stateful workflow engine) commands each step and tracks the saga state.
Example: the same order saga, orchestrated. The saga orchestrator holds the state machine: PENDING → INVENTORY_RESERVED → PAYMENT_PROCESSING → COMPLETED (or INVENTORY_RELEASED → PAYMENT_REFUNDED → CANCELLED). The orchestrator sends commands to services and receives responses. It knows exactly which step the saga is in and handles failures by executing compensating commands.
Orchestration advantages: the saga state is visible in one place (queryable in the orchestrator's database). Failure handling is centralized — the orchestrator decides what compensating transactions to run. The flow is explicit in the orchestrator's code. Monitoring: track all in-flight sagas by querying the orchestrator's state store.
Orchestration disadvantages: the orchestrator is a dependency for all services in the saga — its failure stalls all in-flight sagas. It can become a bottleneck if not scaled appropriately. Technologies: Temporal, Conductor (Netflix), AWS Step Functions, or a custom saga state machine stored in a database.
A compensating transaction is the undo for a step that has already committed locally. Key properties:
Idempotent. Compensating transactions must be idempotent — executing them multiple times produces the same result as executing them once. If the compensating transaction for "release inventory reservation" is called twice (due to a retry), it must not release more inventory than was originally reserved. Implementation: check whether the reservation is already in a released state before attempting to release; use a unique compensation transaction ID to deduplicate.
Not all operations are compensable. Sending an email, notifying a third party, or publishing to an external API cannot be undone. For non-compensable steps: execute them last in the saga, after all compensable steps have committed. This minimizes the window where a non-compensable action has occurred but the overall saga fails. If a non-compensable action has occurred and the saga fails after it, the only option is a manual correction process or a "reverse" notification to the user.
Semantic rollback, not physical rollback. A compensating transaction doesn't undo the database row modification — it creates a new transaction that logically reverses the effect. A reserved inventory row is updated from RESERVED to AVAILABLE, not deleted. This preserves the audit trail — you can see that the reservation occurred and was subsequently released.
Every saga step has a dual write problem: the service must both update its database and publish an event. If the database update succeeds but the message publish fails (network error, Kafka down), the saga stalls — the next service never receives the event to proceed.
The outbox pattern solves this atomically. Instead of publishing directly to Kafka, the service writes the event to an outbox table in the same database transaction as the data update. A separate relay process (often called the Transactional Outbox Processor) reads unprocessed outbox rows and publishes them to Kafka, retrying until success. The event is then marked as processed (or deleted). Since the outbox write is part of the same database transaction as the data update, they're always in sync — either both succeed or both fail.
The relay provides at-least-once delivery (it might publish the same event twice if it crashes between publishing and marking processed). Consumers must be idempotent — processing the same event twice must be safe. This is a foundational requirement for saga participants.
What is the saga pattern?
A sequence of local transactions per service that together achieve a cross-service business operation. Failures trigger compensating transactions to undo previous steps. Provides eventual consistency, not atomicity. The alternative to two-phase commit in microservices.
What is the difference between choreography and orchestration sagas?
Choreography: services react to events, no central coordinator. Simple but hard to observe overall flow. Orchestration: central coordinator (Temporal, Step Functions) holds the state machine. Easier to monitor and debug. Use orchestration for complex sagas with many failure paths.
What are compensating transactions?
The undo operations for saga steps that have committed. Must be idempotent. Not all operations are compensable (email, external API). Execute non-compensable steps last to minimize failure after them. Compensating transactions create new DB records, not physical undos.
What is the outbox pattern and why is it needed for sagas?
Writes event to an outbox table in the same DB transaction as the data update. A relay process publishes to Kafka with retry. Solves dual-write atomicity without distributed coordination. Consumers must be idempotent for the at-least-once delivery guarantee.
Practice in SysSimulator → See sagas in payment system design