Get a quote

Coordinating Distributed Transactions in Go: The Saga Pattern in Production

In a system where one operation requires writing to the main database, charging a payment processor, sending a notification, and updating a cache, you cannot wrap all of that in a single database transaction. The saga pattern is the standard solution, and orchestration-based sagas are significantly easier to debug in production than choreography-based ones.

In a system where one operation requires writing to the main database, charging a payment processor, sending a notification, and updating a cache, you cannot wrap all of that in a single database transaction. The saga pattern is the standard solution. This is how to implement orchestrated sagas in Go with durable state in PostgreSQL, idempotent steps, and reliable compensation logic.

Why two-phase commit does not work for most SaaS architectures

Two-phase commit (2PC) is the textbook answer to distributed transactions: a coordinator asks all participants to prepare, then commits or aborts all of them together. This works in database clusters built for it. In a SaaS backend with heterogeneous systems, 2PC is not practical.

Stripe does not implement a two-phase commit protocol. Neither does Twilio, SendGrid, or most external infrastructure SaaS products depend on. You cannot atomically commit a Stripe charge and a PostgreSQL write inside the same 2PC transaction.

The saga pattern takes a different approach: break the distributed operation into a sequence of local transactions, each of which can be undone by a corresponding compensating transaction if something downstream fails.

Orchestration vs choreography

There are two ways to implement sagas:

Choreography-based sagas have each service react to events. Service A completes its local transaction and publishes an event. Service B picks up the event and runs its transaction, then publishes another event. If Service B fails, it publishes a failure event that Service A listens to for compensation.

Orchestration-based sagas have a central coordinator that calls each service in sequence and tracks the overall state. If a step fails, the orchestrator calls compensation handlers in reverse order.

For most Go SaaS backends in production, orchestration is easier to debug. When a saga fails, you query the orchestrator state table and see exactly which step failed and which compensations ran. With choreography, tracing a failure across multiple service event logs and correlating them is significantly harder. Most MENA-based SaaS teams do not have dedicated SRE capacity to investigate choreography failures efficiently.

Schema for the saga state store

The orchestrator needs durable state so it can recover from crashes mid-execution. A PostgreSQL table is the simplest reliable state store:

CREATE TABLE saga_runs (
    id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    saga_type      TEXT NOT NULL,
    tenant_id      UUID,
    state          JSONB NOT NULL DEFAULT '{}',
    status         TEXT NOT NULL DEFAULT 'running',
    current_step   INT NOT NULL DEFAULT 0,
    error_message  TEXT,
    created_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX ON saga_runs (status) WHERE status IN ('running', 'compensating');
CREATE INDEX ON saga_runs (tenant_id, created_at DESC);

The state JSONB column stores accumulated data that steps pass to each other: IDs created by previous steps, amounts computed earlier, external references. The current_step and status columns let the orchestrator resume a saga after a process crash.

Defining a saga in Go

type SagaStep struct {
    Name       string
    Execute    func(ctx context.Context, state map[string]any) error
    Compensate func(ctx context.Context, state map[string]any) error
}

type SagaDef struct {
    Type  string
    Steps []SagaStep
}

A concrete saga for processing a restaurant order that deducts inventory, records the order, and notifies the kitchen:

var ProcessOrderSaga = SagaDef{
    Type: "process_order",
    Steps: []SagaStep{
        {
            Name: "deduct_inventory",
            Execute: func(ctx context.Context, state map[string]any) error {
                items := mustDecodeItems(state["items"])
                return inventoryService.Deduct(ctx, items)
            },
            Compensate: func(ctx context.Context, state map[string]any) error {
                items := mustDecodeItems(state["items"])
                return inventoryService.Restore(ctx, items)
            },
        },
        {
            Name: "record_order",
            Execute: func(ctx context.Context, state map[string]any) error {
                orderID, err := orderService.Create(ctx, mustDecodeOrderData(state))
                if err != nil {
                    return err
                }
                state["order_id"] = orderID.String()
                return nil
            },
            Compensate: func(ctx context.Context, state map[string]any) error {
                orderID := uuid.MustParse(state["order_id"].(string))
                return orderService.Cancel(ctx, orderID)
            },
        },
        {
            Name: "notify_kitchen",
            Execute: func(ctx context.Context, state map[string]any) error {
                orderID := uuid.MustParse(state["order_id"].(string))
                return kitchenService.Notify(ctx, orderID)
            },
            Compensate: func(ctx context.Context, state map[string]any) error {
                orderID := uuid.MustParse(state["order_id"].(string))
                return kitchenService.CancelNotification(ctx, orderID)
            },
        },
    },
}

The execution loop with crash recovery

func (o *Orchestrator) Execute(ctx context.Context, saga SagaDef, initialState map[string]any) error {
    run, err := o.db.LoadOrCreateRun(ctx, saga.Type, initialState)
    if err != nil {
        return err
    }

    for i := run.CurrentStep; i < len(saga.Steps); i++ {
        step := saga.Steps[i]

        if err := step.Execute(ctx, run.State); err != nil {
            // Step failed: run compensations from the step that just failed backwards
            o.compensate(ctx, saga.Steps, run.State, i)
            return o.db.MarkFailed(ctx, run.ID, i, err)
        }

        // Persist state after each successful step
        // If the process crashes here, the next restart resumes from i+1
        if err := o.db.PersistProgress(ctx, run.ID, i+1, run.State); err != nil {
            return err
        }
    }

    return o.db.MarkComplete(ctx, run.ID)
}

func (o *Orchestrator) compensate(ctx context.Context, steps []SagaStep, state map[string]any, failedAt int) {
    for i := failedAt; i >= 0; i-- {
        if steps[i].Compensate == nil {
            continue
        }
        if err := steps[i].Compensate(ctx, state); err != nil {
            // Compensation failed: enqueue for retry, alert operations
            o.handleCompensationFailure(ctx, steps[i].Name, state, err)
        }
    }
}

The PersistProgress call is the crash-recovery mechanism. Each step persists its outcome and advances the current_step counter atomically. On process restart, a background job loads all sagas with status = 'running' and resumes execution from current_step.

Idempotency: the non-negotiable requirement

Every saga step must be idempotent. If the process crashes after a step applies its effect but before PersistProgress commits, the step will re-execute on recovery. A step that charges a payment card twice is worse than a saga that fails.

The standard pattern is a stable idempotency key derived from the saga ID and step index:

func stepKey(sagaID uuid.UUID, stepIndex int, stepName string) string {
    return fmt.Sprintf("%s:%d:%s", sagaID, stepIndex, stepName)
}

Pass this key to external service calls. Stripe's Idempotency-Key header guarantees a repeated charge request with the same key returns the stored result rather than charging again. For internal services, store the idempotency key alongside the result with a unique constraint:

CREATE TABLE idempotency_records (
    idempotency_key TEXT PRIMARY KEY,
    result          JSONB,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

Handling compensation failures

The hardest production scenario: a compensation step itself fails. The order has been recorded and the kitchen notified, but inventory restoration fails because the inventory service is temporarily unavailable.

The practical approach used by SaaS teams running restaurant and POS systems in MENA:

  1. Retry compensations with exponential backoff via the background job queue
  2. Alert operations if compensations exhaust retries (N = 5 is a reasonable default)
  3. Write failed compensations to a saga_compensation_failures table for manual resolution
  4. Expose a simple admin endpoint that allows operations to manually re-trigger a compensation for a specific saga run

Human-in-the-loop for compensation failures is not a design flaw. It is the correct fallback for cases that automation cannot resolve safely.

Recovery job for in-flight sagas

func (o *Orchestrator) RecoverStuckSagas(ctx context.Context) error {
    // Find sagas that have been 'running' for more than 5 minutes
    // (suggests the process that started them crashed)
    stuck, err := o.db.FindStuckRuns(ctx, 5*time.Minute)
    if err != nil {
        return err
    }
    for _, run := range stuck {
        sagaDef, ok := o.registry[run.SagaType]
        if !ok {
            continue
        }
        go o.Execute(ctx, sagaDef, run.State)
    }
    return nil
}

This job runs on a 1-minute tick. It finds sagas whose process died mid-execution and resumes them. Combined with idempotent steps, this gives the system at-least-once execution semantics across process restarts.

Key lessons from production

Orchestrated sagas are easier to debug than choreographed ones for teams without deep event-streaming experience. The state table gives you a single place to look when investigating a failure.

Every saga step must be idempotent before the saga goes to production. Design this into the step interface from the start.

Compensations will fail in production under real operational conditions. Have a retry queue and a manual override path before you deploy.

Use PostgreSQL as your saga state store. It survives process crashes and gives you ordinary SQL to inspect state during incidents.

Free PDF Download

Enjoying this article?

Enter your email and get a clean, formatted PDF of this article - free, no spam.

Free. No spam. Unsubscribe any time.

Not sure where to start?

Voxire builds backend systems for SaaS products and operational software across Lebanon and MENA. If you are designing a system that involves multi-step operations across services, payments, or inventory that need distributed consistency guarantees, reach out at https://voxire.com/get-a-quote/

Back to blog
Chat on WhatsApp