In a system where one operation requires writing to the main database, charging a payment processor, sending a notification, and updating a cache, you cannot wrap all of that in a single database transaction. The saga pattern is the standard solution, and orchestration-based sagas are significantly easier to debug in production than choreography-based ones.
In a system where one operation requires writing to the main database, charging a payment processor, sending a notification, and updating a cache, you cannot wrap all of that in a single database transaction. The saga pattern is the standard solution. This is how to implement orchestrated sagas in Go with durable state in PostgreSQL, idempotent steps, and reliable compensation logic.
Why two-phase commit does not work for most SaaS architectures
Two-phase commit (2PC) is the textbook answer to distributed transactions: a coordinator asks all participants to prepare, then commits or aborts all of them together. This works in database clusters built for it. In a SaaS backend with heterogeneous systems, 2PC is not practical.
Stripe does not implement a two-phase commit protocol. Neither does Twilio, SendGrid, or most external infrastructure SaaS products depend on. You cannot atomically commit a Stripe charge and a PostgreSQL write inside the same 2PC transaction.
The saga pattern takes a different approach: break the distributed operation into a sequence of local transactions, each of which can be undone by a corresponding compensating transaction if something downstream fails.
Orchestration vs choreography
There are two ways to implement sagas:
Choreography-based sagas have each service react to events. Service A completes its local transaction and publishes an event. Service B picks up the event and runs its transaction, then publishes another event. If Service B fails, it publishes a failure event that Service A listens to for compensation.
Orchestration-based sagas have a central coordinator that calls each service in sequence and tracks the overall state. If a step fails, the orchestrator calls compensation handlers in reverse order.
For most Go SaaS backends in production, orchestration is easier to debug. When a saga fails, you query the orchestrator state table and see exactly which step failed and which compensations ran. With choreography, tracing a failure across multiple service event logs and correlating them is significantly harder. Most MENA-based SaaS teams do not have dedicated SRE capacity to investigate choreography failures efficiently.
Schema for the saga state store
The orchestrator needs durable state so it can recover from crashes mid-execution. A PostgreSQL table is the simplest reliable state store:
CREATE TABLE saga_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
saga_type TEXT NOT NULL,
tenant_id UUID,
state JSONB NOT NULL DEFAULT '{}',
status TEXT NOT NULL DEFAULT 'running',
current_step INT NOT NULL DEFAULT 0,
error_message TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON saga_runs (status) WHERE status IN ('running', 'compensating');
CREATE INDEX ON saga_runs (tenant_id, created_at DESC);
The state JSONB column stores accumulated data that steps pass to each other: IDs created by previous steps, amounts computed earlier, external references. The current_step and status columns let the orchestrator resume a saga after a process crash.
Defining a saga in Go
type SagaStep struct {
Name string
Execute func(ctx context.Context, state map[string]any) error
Compensate func(ctx context.Context, state map[string]any) error
}
type SagaDef struct {
Type string
Steps []SagaStep
}
A concrete saga for processing a restaurant order that deducts inventory, records the order, and notifies the kitchen:
var ProcessOrderSaga = SagaDef{
Type: "process_order",
Steps: []SagaStep{
{
Name: "deduct_inventory",
Execute: func(ctx context.Context, state map[string]any) error {
items := mustDecodeItems(state["items"])
return inventoryService.Deduct(ctx, items)
},
Compensate: func(ctx context.Context, state map[string]any) error {
items := mustDecodeItems(state["items"])
return inventoryService.Restore(ctx, items)
},
},
{
Name: "record_order",
Execute: func(ctx context.Context, state map[string]any) error {
orderID, err := orderService.Create(ctx, mustDecodeOrderData(state))
if err != nil {
return err
}
state["order_id"] = orderID.String()
return nil
},
Compensate: func(ctx context.Context, state map[string]any) error {
orderID := uuid.MustParse(state["order_id"].(string))
return orderService.Cancel(ctx, orderID)
},
},
{
Name: "notify_kitchen",
Execute: func(ctx context.Context, state map[string]any) error {
orderID := uuid.MustParse(state["order_id"].(string))
return kitchenService.Notify(ctx, orderID)
},
Compensate: func(ctx context.Context, state map[string]any) error {
orderID := uuid.MustParse(state["order_id"].(string))
return kitchenService.CancelNotification(ctx, orderID)
},
},
},
}
The execution loop with crash recovery
func (o *Orchestrator) Execute(ctx context.Context, saga SagaDef, initialState map[string]any) error {
run, err := o.db.LoadOrCreateRun(ctx, saga.Type, initialState)
if err != nil {
return err
}
for i := run.CurrentStep; i < len(saga.Steps); i++ {
step := saga.Steps[i]
if err := step.Execute(ctx, run.State); err != nil {
// Step failed: run compensations from the step that just failed backwards
o.compensate(ctx, saga.Steps, run.State, i)
return o.db.MarkFailed(ctx, run.ID, i, err)
}
// Persist state after each successful step
// If the process crashes here, the next restart resumes from i+1
if err := o.db.PersistProgress(ctx, run.ID, i+1, run.State); err != nil {
return err
}
}
return o.db.MarkComplete(ctx, run.ID)
}
func (o *Orchestrator) compensate(ctx context.Context, steps []SagaStep, state map[string]any, failedAt int) {
for i := failedAt; i >= 0; i-- {
if steps[i].Compensate == nil {
continue
}
if err := steps[i].Compensate(ctx, state); err != nil {
// Compensation failed: enqueue for retry, alert operations
o.handleCompensationFailure(ctx, steps[i].Name, state, err)
}
}
}
The PersistProgress call is the crash-recovery mechanism. Each step persists its outcome and advances the current_step counter atomically. On process restart, a background job loads all sagas with status = 'running' and resumes execution from current_step.
Idempotency: the non-negotiable requirement
Every saga step must be idempotent. If the process crashes after a step applies its effect but before PersistProgress commits, the step will re-execute on recovery. A step that charges a payment card twice is worse than a saga that fails.
The standard pattern is a stable idempotency key derived from the saga ID and step index:
func stepKey(sagaID uuid.UUID, stepIndex int, stepName string) string {
return fmt.Sprintf("%s:%d:%s", sagaID, stepIndex, stepName)
}
Pass this key to external service calls. Stripe's Idempotency-Key header guarantees a repeated charge request with the same key returns the stored result rather than charging again. For internal services, store the idempotency key alongside the result with a unique constraint:
CREATE TABLE idempotency_records (
idempotency_key TEXT PRIMARY KEY,
result JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Handling compensation failures
The hardest production scenario: a compensation step itself fails. The order has been recorded and the kitchen notified, but inventory restoration fails because the inventory service is temporarily unavailable.
The practical approach used by SaaS teams running restaurant and POS systems in MENA:
- Retry compensations with exponential backoff via the background job queue
- Alert operations if compensations exhaust retries (N = 5 is a reasonable default)
- Write failed compensations to a
saga_compensation_failurestable for manual resolution - Expose a simple admin endpoint that allows operations to manually re-trigger a compensation for a specific saga run
Human-in-the-loop for compensation failures is not a design flaw. It is the correct fallback for cases that automation cannot resolve safely.
Recovery job for in-flight sagas
func (o *Orchestrator) RecoverStuckSagas(ctx context.Context) error {
// Find sagas that have been 'running' for more than 5 minutes
// (suggests the process that started them crashed)
stuck, err := o.db.FindStuckRuns(ctx, 5*time.Minute)
if err != nil {
return err
}
for _, run := range stuck {
sagaDef, ok := o.registry[run.SagaType]
if !ok {
continue
}
go o.Execute(ctx, sagaDef, run.State)
}
return nil
}
This job runs on a 1-minute tick. It finds sagas whose process died mid-execution and resumes them. Combined with idempotent steps, this gives the system at-least-once execution semantics across process restarts.
Key lessons from production
Orchestrated sagas are easier to debug than choreographed ones for teams without deep event-streaming experience. The state table gives you a single place to look when investigating a failure.
Every saga step must be idempotent before the saga goes to production. Design this into the step interface from the start.
Compensations will fail in production under real operational conditions. Have a retry queue and a manual override path before you deploy.
Use PostgreSQL as your saga state store. It survives process crashes and gives you ordinary SQL to inspect state during incidents.
Enjoying this article?
Enter your email and get a clean, formatted PDF of this article - free, no spam.
Not sure where to start?
Voxire builds backend systems for SaaS products and operational software across Lebanon and MENA. If you are designing a system that involves multi-step operations across services, payments, or inventory that need distributed consistency guarantees, reach out at https://voxire.com/get-a-quote/


