Distributed Tracing in Production Go Services on AWS ECS

Q: What are the common mistakes teams make with OTel in production?

Sampling at 100 percent. This works in development but at 10,000 requests per day it generates roughly 1GB of trace data per week. Start at 10 to 20 percent with a slow-request override.

Once you have more than two Go services talking to each other on ECS, a slow request becomes a needle in a haystack. This is how we wire distributed tracing with OpenTelemetry across services, propagate trace context through HTTP and gRPC, and ship spans to a collector sidecar without adding more than a few milliseconds of overhead.

Once you have more than two Go services talking to each other on AWS ECS, a slow request becomes a needle in a haystack. You see a 900ms response in your load balancer logs, but the load balancer does not tell you whether the latency came from your orders service, your inventory service, or a slow PostgreSQL query inside one of them. Without distributed tracing, you are left with log timestamps and guesswork. This post covers how we wire OpenTelemetry distributed tracing across Go services on ECS Fargate, propagate trace context through HTTP and gRPC calls, and run a collector as a sidecar so traces ship without touching business logic.

Why does distributed tracing matter in a Go microservices setup?

A single user request typically crosses three to six internal service boundaries in a production SaaS backend. The orders service calls inventory, inventory checks a pricing rule via the pricing service, the pricing service reads a Redis cache, and the whole thing writes to PostgreSQL. Each hop adds latency. Logs tell you what happened. Traces tell you where time went.

OpenTelemetry (OTel) is the industry standard for this. It gives you a vendor-neutral instrumentation API, a Go SDK, automatic HTTP and gRPC propagation, and a collector daemon that batches and exports spans to any backend. We use it across all our Go projects in Lebanon and the wider MENA region because it works the same whether you are exporting to Grafana Tempo on a $30/month VPS or to AWS X-Ray on a multi-region ECS cluster.

How do you set up the OpenTelemetry SDK in a Go service?

The setup breaks into two parts: configuring the tracer provider at startup, and then using the tracer inside your handlers and service layers.

First, initialize the provider in main.go or a dedicated telemetry package:

package telemetry

import (
    "context"
    "fmt"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)

func InitTracer(ctx context.Context, serviceName, endpoint string) (*sdktrace.TracerProvider, error) {
    exp, err := otlptracehttp.New(ctx,
        otlptracehttp.WithEndpoint(endpoint),
        otlptracehttp.WithInsecure(), // collector runs on localhost sidecar
    )
    if err != nil {
        return nil, fmt.Errorf("create exporter: %w", err)
    }
    res := resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceName(serviceName),
        semconv.DeploymentEnvironment("production"),
    )
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exp),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.20))),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

The sampler here uses 20 percent tail-based sampling. For a SaaS under 50,000 requests per day on ECS Fargate, this keeps your trace storage costs under $5 per month without missing any slow outliers, because you should also pair it with AlwaysSample for requests where the response time exceeds 500ms. That combination is called head-based with a slow-request override and it covers 90 percent of real debugging cases.

In main.go, call InitTracer before starting the HTTP server, and defer a shutdown:

tp, err := telemetry.InitTracer(ctx, "orders-svc", "localhost:4318")
if err != nil {
    log.Fatal(err)
}
defer tp.Shutdown(context.Background())

How do you propagate trace context across HTTP service calls?

Trace context propagation is where most Go developers make a mistake. You cannot just pass a trace ID in a custom header. The W3C Trace Context standard defines a traceparent header that carries the trace ID, parent span ID, and sampling flags. OpenTelemetry handles this for you if you use the right middleware.

For an outgoing HTTP client, wrap your http.Client with the OTel transport:

import (
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func NewHTTPClient() *http.Client {
    return &http.Client{
        Transport: otelhttp.NewTransport(http.DefaultTransport),
        Timeout:   10 * time.Second,
    }
}

For the receiving service, wrap your HTTP router with otelhttp.NewHandler:

handler := otelhttp.NewHandler(mux, "orders-svc",
    otelhttp.WithTracerProvider(otel.GetTracerProvider()),
)

With these two pieces in place, every HTTP call between your services automatically injects and extracts the traceparent header. The trace ID stays the same across all service boundaries. In Grafana Tempo or AWS X-Ray, you can see the full waterfall of a single user request across all services.

How do you add custom spans for database calls and business logic?

Automatic HTTP instrumentation gives you service-level spans. For anything more specific, like a slow PostgreSQL query or a critical business rule evaluation, you add manual spans:

func (s *OrderService) CreateOrder(ctx context.Context, req CreateOrderRequest) (*Order, error) {
    ctx, span := otel.Tracer("orders-svc").Start(ctx, "CreateOrder")
    defer span.End()

    span.SetAttributes(
        attribute.String("tenant_id", req.TenantID.String()),
        attribute.Int("item_count", len(req.Items)),
    )

    ctx, dbSpan := otel.Tracer("orders-svc").Start(ctx, "db.insert_order")
    order, err := s.repo.Insert(ctx, req)
    dbSpan.End()
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, err
    }
    return order, nil
}

The tenant_id attribute is particularly useful in multi-tenant SaaS. When you see a slow trace, you can immediately filter by tenant and identify whether the slowness is tenant-specific (usually a data volume issue) or systemic (usually an index or a hot partition).

How do you run the OTel collector as an ECS sidecar?

The OpenTelemetry Collector runs as a second container in your ECS task definition. It receives spans from your Go service on localhost:4318, batches them, and exports to your tracing backend. Here is the key section of the task definition:

{
  "name": "otel-collector",
  "image": "otel/opentelemetry-collector-contrib:0.100.0",
  "essential": false,
  "portMappings": [],
  "environment": [
    { "name": "OTEL_LOG_LEVEL", "value": "warn" }
  ],
  "mountPoints": [
    {
      "sourceVolume": "otel-config",
      "containerPath": "/etc/otelcol-contrib"
    }
  ],
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/ecs/otel-collector",
      "awslogs-region": "us-east-1",
      "awslogs-stream-prefix": "ecs"
    }
  }
}

The collector config (config.yaml) mounts from an ECS volume or is baked into the image:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
  memory_limiter:
    limit_mib: 100

exporters:
  otlphttp:
    endpoint: https://tempo.your-backend.com
    headers:
      Authorization: "Bearer ${TEMPO_TOKEN}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]

We set essential: false on the collector because if the tracing pipeline degrades, we want the business logic container to keep running. Tracing is observability infrastructure, not product functionality.

What does this look like in practice for MENA SaaS teams?

For one of our clients in Lebanon operating a restaurant management platform across 12 locations, distributed tracing was the first thing that showed us a real problem: the tenant-scoped inventory query was slow specifically for two tenants with more than 50,000 stock movements in the database. The trace showed that the inventory-svc span was 280ms while the same span for other tenants was 18ms. That led us straight to a missing composite index on (tenant_id, created_at) in the stock_movements table.

Without the trace, we would have been looking at aggregate database metrics showing overall query time was fine. The per-tenant visibility that tracing provides is what made the problem visible.

For startups running on a tighter budget, you can use Grafana Cloud's free tier (50GB traces per month) and route the OTel collector export there. That covers most early-stage SaaS backends in the region without any infrastructure cost.

What are the common mistakes teams make with OTel in production?

Sampling at 100 percent. This works in development but at 10,000 requests per day it generates roughly 1GB of trace data per week. Start at 10 to 20 percent with a slow-request override.

Not propagating context through goroutines. When you start a goroutine inside a traced function, you must pass the context explicitly. If you use context.Background() inside the goroutine, the span relationship breaks and you get orphaned spans.

Setting the collector as essential. If the collector crashes due to a bad export endpoint, your entire ECS task restarts. Mark it as non-essential.

Forgetting to call span.End(). Use defer span.End() immediately after creating a span. Unclosed spans cause memory leaks in the SDK's span processor.

Key lessons from production

Distributed tracing pays back its setup cost within the first production incident where you need it. The OpenTelemetry Go SDK is stable and well-maintained. The ECS sidecar pattern keeps tracing infrastructure out of your container images and makes it easy to swap backends. Start with HTTP auto-instrumentation, add manual spans around your database calls and critical business logic, and use tenant ID as a span attribute from day one.

Not sure where to start?

Voxire builds and operates Go backend systems for SaaS platforms across Lebanon and the MENA region. If you are running services on ECS and want to add observability without spending a week on configuration, reach out and we can walk you through the setup.

https://voxire.com/get-a-quote/