In today’s world, scalability is a common challenge that most of us face when developing applications. To scale out and build easily manageable services, we often break down a system’s responsibilities into multiple microservices. In a microservices architecture, each service manages its own database, and the type of database can differ between services. This diversity complicates implementing a two-phase commit, and in many cases, services don’t always require strong consistency.

distributed-transaction-without-saga

HTTP Call to Update Inventory (Service Unavailable)

Let’s explore this issue using an example of an e-commerce platform, where we might have an order service and an inventory service. When a user places an order, the order service creates an entry in its database and needs to update the product inventory once the payment is successfully processed.

Since these are two separate services, potentially managed by different teams, one might use a relational database like PostgreSQL, while the other could rely on a NoSQL database like MongoDB.

When placing an order, we know the operation must be handled as a transaction. The order cannot be placed without updating the inventory, and the inventory cannot be updated without placing the order.

Before going further, we need to understand what is transaction? Well, transaction is a sequence of operations performed as a single logical unit of work, ensuring atomicity, consistency, isolation and durability, either complete success or full rollback. These properties are available in relational databases like MySQL and PostgreSQL to maintain data consistency. ACID supported relational databases generally uses 2PC (two phase commit) to ensure strong consistency. However, in distributed systems this is more complex and harder to achieve.

To manage transactions in distributed systems, we can utilize the Saga pattern. In our previous article, we explored how distributed services interact through Choreography and Orchestration , as well as how to ensure data integrity and consistency using the Outbox pattern.

The Saga pattern can be implemented in two ways: one approach involves a central orchestrator managing the transaction lifecycle, while the other relies on choreography. Let’s delve into both approaches of the Saga pattern using real-life examples.

Saga Orchestration

Saga orchestration is a pattern used to manage transactions that span multiple microservices. Instead of relying on traditional distributed transactions (which are difficult to implement in microservices due to their independence), a saga splits the transaction into smaller, local transactions. Each service performs its task and then informs a central Orchestrator, which coordinates the workflow.

If one service fails, the orchestrator triggers compensating actions to undo the work of previous services, ensuring consistency across the system. This rollback mechanism is essential in ensuring the system does not leave the platform in an inconsistent state when something goes wrong.

distributed-transaction-with-saga-orchestration

Handling Distributed Transactions in the Event of a Local Transaction Failure

In an e-commerce system, the process to place an order spans multiple services. The Order Service handles the order placement, followed by the Payment Service for processing payment. Once the payment is successful, the Inventory Service updates stock, and finally, the Notification Service sends an email to inform the user about the order status.

These services need to interact in a sequence to complete an order, and if something fails (for example: Inventory service failed), the system needs to gracefully roll back the transaction. Here’s how saga orchestration ensures smooth operation.

Step-by-Step Workflow:

  1. Order Creation

When a customer places an order, the first step in the workflow is creating the order in the system. The Order Service receives the request, creates a new order record, and marks the order as “PENDING” until the payment is processed.

Once the order is created, the Orchestrator is notified and takes control of the workflow. It then instructs the next microservice, the Payment Service, to process the payment for the order.

  1. Payment Processing

The Payment Service is responsible for charging the customer’s payment method. This could involve processing a credit card, using a third-party payment gateway, or another form of transaction.

If the payment is successful, the Payment Service informs the Orchestrator, and the transaction continues to the next step. However, if the payment fails—perhaps due to insufficient funds or a payment gateway error—the orchestrator is immediately notified, and the saga begins its compensation process.

  1. Update inventory

Once the payment is successful, the orchestrator will update the inventory and reduce the product stock quantity.

  1. Sending Notifications

Assuming the previous steps succeeded, the next step is to notify the customer that their order has been successfully placed. The Orchestrator instructs the Notification Service to send an order confirmation email or SMS to the customer.

This step completes the transaction. Once the notification is sent, the orchestrator updates the status of the order from “PENDING” to “COMPLETED,” and the saga ends successfully. Note: we can update the state to ‘COMPLETED’ based on the previous step as notification can be optional in terms of this transaction.

Handling Failures and Rollbacks

Failures in any distributed system are inevitable. With the saga orchestration pattern, handling these failures becomes much more manageable. Let’s explore what happens when things don’t go as planned.

Payment Failure: If the Payment Service fails to process the payment (due to a technical issue or insufficient funds), the orchestrator will initiate the compensation process. This means the Order Service will be asked to cancel the order, update its status to “CANCELED,” this leads to trigger compensationary transaction C2->C1.

The customer is not charged, and no notification is sent since the order did not go through. The orchestrator logs the failure, ensuring that the platform is aware of the unsuccessful transaction.

Inventory Failure: In the event that the Inventory Service fails to reduce the stock, it triggers a compensation process (C3) that cascades through the transaction, leading to C2 (refund payment) and C1 (cancel order). However, one crucial point to keep in mind is that each service must implement a retry mechanism. This ensures that temporary issues, such as network glitches or momentary downtime, do not result in immediate failure. By retrying, services can attempt to complete their tasks before reporting a failure, minimizing unnecessary rollbacks and ensuring smoother transaction flow.

Notification Failure: If the Notification Service fails (e.g., due to an issue with the email provider), the orchestrator might not need to roll back the entire transaction. Instead, it can log the failure and notify the system administrator that the customer wasn’t informed of the order. This is a non-critical error that can be handled separately from the core transaction.

Can we effectively manage these states and their transitions based on different inputs? Yes, By modeling the process using a Finite State Machine (FSM).

FSM allows us to define each state—such as order placement, payment processing, inventory update, and notification—and map the transitions between them. These transitions are triggered by inputs like Success or Failure at each step. For example, if payment is successful, the FSM moves to the inventory update step; if a failure occurs, it triggers a transition to the appropriate compensation actions. This structured approach helps manage complex workflows efficiently.

Current StateInput/ConditionNext StateAction
Order Created (T1)SuccessPayment Processing (T2)Proceed to payment processing
Order Created (T1)FailureCompensation (C1)Cancel the order (C1)
Payment Processing (T2)SuccessInventory Update (T3)Proceed to inventory update
Payment Processing (T2)FailureCompensation (C2, C1)Refund payment (C2) and cancel the order (C1)
Inventory Update (T3)SuccessCompletionComplete the order
Inventory Update (T3)FailureCompensation (C3, C2, C1)Restore inventory (C3), refund payment (C2), cancel the order (C1)

Saga Choreography

In the Saga choreography pattern, there is no central orchestrator or coordinator to control the flow of transactions. Instead, services communicate through a message queue or an event bus. Each service listens for specific events or topics and reacts accordingly. Once a service completes its task, it publishes an event or command to signal the next service to continue the process.

This decentralized approach allows each service to handle its own part of the transaction independently. For instance, after the Order Service creates an order, it publishes an event. The Payment Service listens for that event, processes the payment, and then publishes another event for the Inventory Service to update stock. The flow continues in this manner, with each service both reacting to and publishing events to move the transaction forward.

distributed-transaction-with-saga-choreography

Handling Distributed Transactions through Decentralized Event-Driven Choreography

Let’s break down the step-by-step flow of choreography using the Order, Payment, and Inventory services. Each service communicates through events without a central orchestrator, making this an event-driven transaction management system.

Step-by-Step Workflow:

  1. Order Creation

Customer places an order and The Order Service processes the order and creates an entry for it. Once the order is successfully created, the service publishes an event called ORDER_CREATED to notify other services.

  1. Payment Processing

The Payment Service listens for the ORDER_CREATED event. Upon receiving it, the Payment Service initiates the payment process (e.g., charging the customer’s credit card). If the payment is successful, the Payment Service publishes a PAYMENT_COMPLETED event, otherwise the service publishes a PAYMENT_FAILED event, which can trigger a rollback (e.g., cancel the order).

  1. Inventory Update

The Inventory Service listens for the PAYMENT_COMPLETED event. When it receives this event, it reduces the stock of the items in the order. If the stock is successfully updated, the Inventory Service publishes an STOCK_UPDATED event to continue the transaction flow. If the stock update fails (e.g., insufficient stock), it publishes a STOCK_UPDATE_FAILED event. This event can trigger compensating actions like issuing a refund and canceling the order.

  1. Sening Notifications

The Notification Service listens for the STOCK_UPDATED event. When it receives the event, it sends a confirmation email to the customer, notifying them that their order is complete and ready for shipment. If there were earlier failures (e.g., payment or inventory update failures), the Notification Service can also listen to failure events like PAYMENT_FAILED, STOCK_UPDATE_FAILED or only ORDER_FAILED, notifying the customer about the failure and status of their order.

Handling Failures and Rollbacks

If a failure occurs at any step, such as payment failure or inventory update failure, compensating transactions are triggered via published failure events:

If the Payment Service fails to process the payment, it publishes a PAYMENT_FAILED event. The Order Service listens to this event and cancels the order. If the Inventory Service cannot update the stock, it publishes an STOCK_UPDATE_FAILED event, which triggers a refund in the Payment Service and order cancellation in the Order Service.

Trade-offs Between Saga Orchestration and Choreography

When designing distributed systems, choosing between saga orchestration and saga choreography depends on various factors such as complexity, performance, and the flexibility of your architecture.

Orchestration offers centralized control, making it easier to manage complex workflows, but this can lead to tighter coupling and potential bottlenecks.

Choreography promotes loose coupling and flexibility, which allows for better scalability and resilience but increases the complexity of managing distributed events and tracking the workflow. Your decision should be based on the specific requirements of your system, including how critical centralized control, flexibility, and scalability are to your application’s success.

For deeper understanding of the interaction mechanisms in distributed systems, please go through my previous article on choreography and orchestration