Distributed Transaction Processing in Microservices

Distributed Transaction Processing in Microservices

Problem Description: In a microservices architecture, a business operation often requires coordination across multiple services, each with its own independent database. How can we ensure that the data operations across all services either all succeed or all roll back, i.e., achieve consistency in distributed transactions?

Knowledge Explanation:

  1. Root Causes and Challenges

    • Limitations of Local Transactions: In monolithic applications, we can easily ensure data consistency using the database's ACID transactions (Atomicity, Consistency, Isolation, Durability). However, in microservices, each service's data is independent, making it impossible to directly use a single database transaction to cover data operations across multiple services.
    • Core Challenge: Distributed systems face various unreliable factors such as network issues, service failures, and database problems, which may result in some service calls succeeding while others fail, leading to data inconsistency.
  2. Solution Evolution: From Strong Consistency to Eventual Consistency
    Solutions for distributed transactions mainly fall into two categories: those pursuing strong consistency and those accepting eventual consistency.

    • Solution One: Two-Phase Commit (2PC)

      • Goal: Achieve strong consistency, mimicking the commit process of database transactions.
      • Roles:
        • Coordinator (Transaction Manager): An independent component responsible for uniformly scheduling all participants.
        • Participant (Resource Manager): Each microservice, responsible for executing local database operations.
      • Process:
        • Phase One: Prepare Phase
          1. The coordinator sends a "prepare request" along with the transaction content to all related participants.
          2. Each participant executes the local transaction operations (e.g., update, insert) but does not perform the final commit. All operations are locked.
          3. The participant returns a response to the coordinator: "Prepare successful (Yes)" or "Prepare failed (No)".
        • Phase Two: Commit/Rollback Phase
          1. Scenario A: If the coordinator receives all responses as "Yes":
            • The coordinator sends a "commit request" to all participants.
            • Upon receiving the commit, participants formally commit their local transactions, release locks, and return an "ACK" indicating completion.
            • After receiving all ACKs, the coordinator completes the entire transaction.
          2. Scenario B: If the coordinator receives any "No" response, or times out waiting for a response:
            • The coordinator sends a "rollback request" to all participants.
            • Upon receiving the rollback, participants roll back the operations performed during the prepare phase, release locks, and return an "ACK" indicating rollback completion.
      • Advantages: Guarantees strong consistency; the data state of all nodes is completely consistent after commit.
      • Disadvantages:
        • Synchronous Blocking: During the prepare phase, resources of all participants are locked and inaccessible to other transactions until the second phase ends. This leads to poor performance.
        • Single Point of Failure: The coordinator is crucial. If it fails, participants remain locked, causing system blockage.
        • Risk of Data Inconsistency: In the second phase, if only some participants receive the commit instruction, data inconsistency can occur.
    • Solution Two: Patterns Based on Eventual Consistency (More Suitable for Microservices)
      Modern microservices architectures tend to favor eventual consistency solutions, trading strong consistency for higher availability and performance. The most mainstream pattern is the Saga Pattern.

      • Core Idea of the Saga Pattern: Break a distributed transaction into a series of local transactions, each with a corresponding compensating transaction designed to undo the effects of that local transaction.
      • Execution Methods:
        • Orchestration-based Saga:
          1. Introduce a Saga Orchestrator responsible for sequentially calling the local transactions of each Saga participant.
          2. If a local transaction succeeds, the orchestrator calls the next one.
          3. Key Point: If a local transaction fails, the orchestrator reversely calls in sequence the compensating operations for all previously successful transactions.
          4. Example (Create Order Saga):
            • Transaction T1: Order Service -> Creates an order in "pending confirmation" status.
            • Transaction T2: Inventory Service -> Deducts inventory.
            • Transaction T3: Payment Service -> Processes the payment deduction.
            • Compensating Operations:
              • C1: Order Service -> Updates the order status to "cancelled".
              • C2: Inventory Service -> Restores the deducted inventory.
              • C3: Payment Service -> Executes a refund.
            • Flow: Execute T1 -> Success -> Execute T2 -> Success -> Execute T3 -> Failure -> Execute C2 (compensate T2) -> Execute C1 (compensate T1). Eventually, all service data returns to the state before the transaction began.
        • Choreography-based Saga (Event-Driven):
          1. No central orchestrator. Each service independently listens to and produces domain events.
          2. After executing its local transaction, a service publishes an event.
          3. The next service listens for that event, executes its own local transaction, and then publishes a new event.
          4. If a service execution fails, it publishes a failure event. Services listening for that event trigger their own compensating operations.
          • Example: Order Service creates an order and publishes an OrderCreated event -> Inventory Service listens and deducts inventory, publishing an InventoryUpdated event -> Payment Service listens and attempts payment deduction. If it fails, it publishes a PaymentFailed event -> Inventory Service listens to the failure event and executes inventory restoration.
      • Advantages:
        • High Availability: No global locks; services are decoupled.
        • High Performance: Avoids long-term resource locking.
      • Disadvantages:
        • Eventual Consistency: Data may be inconsistent at certain moments (e.g., order is in "pending confirmation" status, but inventory has already been deducted).
        • Design Complexity: Requires designing a corresponding, reversible compensating operation for each forward operation, and the compensating operation must be guaranteed to succeed (idempotent).
  3. Summary and Selection

    • 2PC: Suitable for scenarios with extremely high requirements for strong consistency, a small number of participating services, and low performance demands (e.g., core financial transactions).
    • Saga: The preferred solution for handling distributed transactions in microservices architectures. By accepting eventual consistency within business-acceptable limits, it achieves high availability and scalability for the system. Orchestration-based Saga offers clearer control flow, while event-driven Saga provides greater decoupling between services.

By understanding the evolution from 2PC to Saga, you can see the trade-off between consistency and availability (CAP theorem) in distributed system design, which is also one of the core ideas in microservices design.