Blue-Green Deployment on AWS ECS for Zero-Downtime Go Service Releases

Every deployment is a moment of risk. The simplest way to eliminate deployment downtime in AWS ECS is blue-green, but the operational details are where most teams get stuck. Here is how we implement it for Go services in production, with notes on what we have learned running this for SaaS products across Lebanon and the MENA region.

Why rolling deploys cause problems in Go services

ECS rolling updates are the default. They gradually replace old task instances with new ones. For stateless HTTP services with short request lifetimes, rolling updates usually work fine. But there are failure modes that appear only when you look closely.

During a rolling update, the old version and the new version coexist. If the new version has an incompatible database schema change, old instances hitting the new schema fail. If the new version changes the format of data written to a shared cache or message queue, old instances reading that data get corrupted state.

Rolling updates also have a narrow window where your ALB health checks have not yet failed the new task but users are already hitting it. If the new container starts up and passes a shallow /health check but then fails under load, you have a partial rollout in production before ECS detects the failure and pulls back.

Blue-green deployment eliminates the coexistence period. Old version stays 100% live until the new version is fully validated. The switch is atomic.

The ECS blue-green architecture

The setup uses two ECS services and two ALB target groups behind a single listener:

Service Blue: current production, receiving 100% of traffic.
Service Green: new version deployed to and validated before traffic shifts.
ALB Listener rule: forward to Blue by default, switchable to Green.
Target group health checks configured tightly enough to catch real failures.

When a release happens:

Deploy new version to the Green service.
Wait for all Green tasks to pass health checks.
Run smoke tests or integration checks against Green through an internal test URL.
Shift traffic: ALB listener rule updated to forward to Green.
Monitor error rates on Green for five to ten minutes.
If stable: deregister Blue, update Blue to the new version for next deployment.
If not stable: revert ALB rule to Blue in under 30 seconds.

Step 6 keeps both services on the same version after a successful release, ready for the next cycle.

Terraform for the ALB configuration

The core infrastructure is two target groups and a weighted forward rule:

resource "aws_lb_target_group" "blue" {
  name        = "${var.service_name}-blue"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 10
    timeout             = 5
  }
}

resource "aws_lb_target_group" "green" {
  name        = "${var.service_name}-green"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 10
    timeout             = 5
  }
}

resource "aws_lb_listener_rule" "main" {
  listener_arn = aws_lb_listener.https.arn
  priority     = 100

  action {
    type = "forward"
    forward {
      target_group {
        arn    = aws_lb_target_group.blue.arn
        weight = 100
      }
      target_group {
        arn    = aws_lb_target_group.green.arn
        weight = 0
      }
    }
  }

  condition {
    host_header {
      values = ["api.yourdomain.com"]
    }
  }
}

The initial state sends 100% to Blue and 0% to Green. The traffic shift is a Terraform change to those weight values, or a direct AWS API call for speed.

Health check endpoint in Go

The health check must be meaningful. A check that always returns 200 does not protect you from a broken deployment.

func (h *Handler) Health(w http.ResponseWriter, r *http.Request) {
	ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second)
	defer cancel()

	if err := h.db.PingContext(ctx); err != nil {
		http.Error(w, `{"status":"unhealthy","db":"unreachable"}`, http.StatusServiceUnavailable)
		return
	}

	w.Header().Set("Content-Type", "application/json")
	w.WriteHeader(http.StatusOK)
	w.Write([]byte(`{"status":"ok"}`))
}

This checks database connectivity. For services that depend on Redis or a message queue, add checks for those too. The 3-second timeout prevents a hung database connection from blocking the health check indefinitely.

The health check endpoint should be excluded from authentication middleware. The ALB health checker does not send auth headers.

Graceful shutdown integration

Blue-green is only zero-downtime if the containers shut down cleanly. When ECS drains a task, it sends SIGTERM. The Go service must finish in-flight requests before exiting.

func main() {
	srv := &http.Server{
		Addr:    ":8080",
		Handler: router,
	}

	go func() {
		if err := srv.ListenAndServe(); err != http.ErrServerClosed {
			log.Fatalf("server error: %v", err)
		}
	}()

	quit := make(chan os.Signal, 1)
	signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
	<-quit

	log.Println("Shutting down...")
	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
	defer cancel()

	if err := srv.Shutdown(ctx); err != nil {
		log.Fatalf("forced shutdown: %v", err)
	}
	log.Println("Server stopped cleanly")
}

The 30-second shutdown window should match the ECS stopTimeout setting on the task definition. If ECS sends SIGKILL after 30 seconds and your Shutdown context is also 30 seconds, you have a race. Set the ECS stopTimeout to 60 seconds and the Go shutdown context to 45 seconds to give the server time to drain.

Also configure the ALB deregistration delay to 30 seconds. When ECS marks a task for removal, ALB gets 30 seconds to stop sending new connections to it before the task shuts down.

The traffic shift procedure

We run the traffic shift as a script rather than through Terraform to avoid a full plan-apply cycle during a time-sensitive deployment:

#!/bin/bash
set -euo pipefail

LISTENER_ARN=$1
BLUE_ARN=$2
GREEN_ARN=$3
ACTION=${4:-shift}  # shift or revert

if [ "$ACTION" = "shift" ]; then
  BLUE_WEIGHT=0
  GREEN_WEIGHT=100
else
  BLUE_WEIGHT=100
  GREEN_WEIGHT=0
fi

aws elbv2 modify-listener \
  --listener-arn "$LISTENER_ARN" \
  --default-actions '[{
    "Type": "forward",
    "ForwardConfig": {
      "TargetGroups": [
        {"TargetGroupArn": "'"$BLUE_ARN"'", "Weight": '"$BLUE_WEIGHT"'},
        {"TargetGroupArn": "'"$GREEN_ARN"'", "Weight": '"$GREEN_WEIGHT"'}
      ]
    }
  }]'

echo "Traffic shifted: Blue=$BLUE_WEIGHT% Green=$GREEN_WEIGHT%"

This script runs in under three seconds. A revert is the same script with revert as the fourth argument.

Database migration strategy for blue-green

Blue-green deployment moves traffic atomically, but database migrations do not. A migration that adds a NOT NULL column without a default will break the Blue service that is still running while Green runs the migration.

The rule we follow is expand-then-migrate:

Deploy a version that can work with both old and new schema (additive changes only: add columns with defaults, add tables, add nullable columns).
Run the migration.
Deploy the version that uses the new schema exclusively.

This means some releases require two deployments. For dropping a column or renaming one, the sequence is longer: add new column, deploy to write both, deploy to read only new, drop old column. It is more work but it is the only approach that is safe.

Monitoring the cutover period

Immediately after shifting traffic to Green, watch:

Error rate on Green target group (ALB HTTPCode_Target_5XX_Count metric).
Request latency P99 on Green (should be within 20% of Blue's baseline).
Application error logs in CloudWatch.
Database connection count (a buggy new version may leak connections).

We run a five-minute observation window before considering a release stable. If any metric deviates significantly, the revert is one command and takes three seconds. The Blue service is still running and fully warmed.

Cost implications for Lebanon and MENA startups

Running two ECS services doubles the task count during the deployment window. For Fargate, this is a real cost. But the window is short: typically five to fifteen minutes per release. A service running two 0.25 vCPU / 512 MB tasks costs roughly $0.03 per hour on Fargate. A fifteen-minute deployment window adds $0.0075 per release.

For startups in Lebanon and the MENA region where deployment confidence directly affects team velocity, the cost is negligible against the benefit of reliable zero-downtime releases.

Key lessons from production

Blue-green on ECS is reliable but requires attention to a few details:

Health checks must be meaningful. A 200 response on a broken service defeats the purpose.
Graceful shutdown timeout must account for both Go's shutdown window and ECS drain time.
Database migrations must follow expand-contract. Never run a breaking migration alongside a blue-green swap.
Keep the revert script ready. The whole value of blue-green is that reverting is fast. Test the revert in staging before the first production release.
Use the ALB deregistration delay. Without it, in-flight requests to a draining task will fail.

The operational confidence that comes from knowing any deployment can be reverted in three seconds changes how teams approach releases in practice.

Why rolling deploys cause problems in Go services

The ECS blue-green architecture

Terraform for the ALB configuration

Health check endpoint in Go

Graceful shutdown integration

The traffic shift procedure

Database migration strategy for blue-green

Monitoring the cutover period

Cost implications for Lebanon and MENA startups

Key lessons from production

Not sure where to start?

Keep reading

Distributed Tracing in Production Go Services on AWS ECS

Optimizing Go Docker Images for AWS ECS: Multi-Stage Builds and Minimal Containers

Infrastructure as Code for Go SaaS on AWS: Managing ECS and RDS with Terraform