Running Scheduled and Batch Tasks on AWS ECS with Go: What Works in Production

Every SaaS backend has work that runs on a schedule: nightly billing runs, daily report aggregation, periodic data cleanup, weekly export generation. How you schedule that work determines how reliably it runs and how easy it is to debug when it fails. This is how we run scheduled and batch tasks for Go backends on ECS.

The three approaches and when each breaks

There are three common patterns for running scheduled work in a SaaS backend:

In-process cron using a Go library. A package like robfig/cron runs a goroutine on a schedule within the main application process. Simple to implement, zero infrastructure overhead. It breaks as soon as you need to run more than one replica of your service, because all replicas will run the job simultaneously. It also breaks when the application process restarts, because jobs that were in progress are lost without a recovery mechanism.

AWS Lambda with EventBridge. EventBridge triggers a Lambda function on a cron expression. Low operational overhead, scales to zero when not running. It breaks on tasks that run longer than the Lambda 15-minute limit, on tasks that need more than 3GB of memory, and on tasks where cold start latency matters for correctness (some financial batch jobs have timing requirements that cold Lambda starts can violate).

ECS Scheduled Tasks. EventBridge triggers an ECS task definition to run as a standalone ECS task. The task runs to completion and terminates. No Lambda limits, no cold start constraints, access to the full memory and CPU of the task definition. The main downside is that ECS task launch adds about 20 to 40 seconds of startup overhead before your code runs.

For Go SaaS backends already running on ECS, ECS Scheduled Tasks is the right default for most scheduled work. It uses the same task definition as your main service, so the container image, environment variables, and IAM roles are already configured. You do not need to maintain a separate Lambda deployment.

Setting up an ECS Scheduled Task for a Go backend

The setup involves three components: an ECS task definition, an EventBridge rule, and an IAM role that allows EventBridge to launch ECS tasks.

The task definition for a scheduled task can reuse the same Docker image as your main service. The difference is the command override: instead of running the main HTTP server, you run a specific subcommand in the same binary.

In Go, the pattern is to implement all scheduled jobs as subcommands in the same binary using a CLI library like cobra or a simple switch on os.Args[1]:

func main() {
    if len(os.Args) < 2 {
        runHTTPServer()
        return
    }
    switch os.Args[1] {
    case "billing-run":
        runBillingJob()
    case "report-aggregation":
        runReportAggregation()
    case "cleanup":
        runDataCleanup()
    default:
        log.Fatalf("unknown command: %s", os.Args[1])
    }
}

The ECS task definition for the scheduled job uses the same container image with a command override to ["billing-run"]. The EventBridge rule specifies the cron expression (e.g., cron(0 2 * * ? *) for 2 AM UTC daily) and the target as the ECS cluster and task definition.

Handling failures and retries

ECS Scheduled Tasks do not retry by default when a task exits with a non-zero status code. You need to handle failure detection and notification explicitly.

The pattern that works: wrap the main job logic in a deferred function that checks for panics and sends an alert if the job exits with an error. Log the start time, end time, and exit status to a structured log. A CloudWatch alarm on a custom metric that the job emits on completion catches failures without requiring you to parse log output.

For jobs that must complete (billing runs, subscription renewals), implement idempotency in the job logic itself. The job should be safe to re-run if EventBridge triggers it twice on the same day (which can happen due to EventBridge's at-least-once delivery guarantee). Store a job execution record in PostgreSQL before processing. If a record for today already exists and is in a completed state, the job exits early.

For jobs that process a large number of records, process in batches with a cursor. Store the cursor position in PostgreSQL. If the job is interrupted (ECS task is stopped, container crashes), the next run resumes from where it left off.

Cost considerations for ECS Scheduled Tasks on Fargate

Fargate pricing is per second of CPU and memory usage. A nightly billing job that runs for 3 minutes on 0.5 vCPU and 1GB memory costs roughly $0.002 to $0.003 per run. For daily jobs, that is less than $1 per month.

For less frequent jobs (weekly, monthly), ECS Scheduled Tasks on Fargate is extremely cost-effective because you pay only for the compute used during the run, not for idle capacity. A monthly billing run of 8 minutes costs less than $0.02.

For jobs that run very frequently (every minute or every 5 minutes), the 20 to 40 second ECS task startup overhead becomes a significant fraction of the job duration. For high-frequency work, keep the processing in the main service process with distributed locking to ensure only one replica runs each execution.

Distributed locking for in-process jobs

When a job must run in-process (high frequency, low latency requirements) but the service runs multiple replicas, distributed locking prevents multiple replicas from running the same job concurrently.

The PostgreSQL advisory lock pattern works well in Go: before running a scheduled job, attempt to acquire a PostgreSQL advisory lock. If the lock is already held by another process, skip this execution. If the lock is acquired, run the job and release the lock on completion.

func tryAcquireAdvisoryLock(db *pgxpool.Pool, lockID int64) (bool, error) {
    var acquired bool
    err := db.QueryRow(ctx,
        "SELECT pg_try_advisory_lock($1)", lockID).Scan(&acquired)
    return acquired, err
}

The lock ID is a stable integer associated with the specific job. When the job exits (including on panic via defer), the lock is released. The lock is automatically released if the database connection drops.

Observability for scheduled tasks

The most common issue with scheduled tasks in production is silent failure. The task ran, the container exited with code 0, but the job did nothing because a query returned unexpected results or an API call timed out silently.

The pattern that prevents silent failures: emit a custom CloudWatch metric at the end of each successful job execution. A CloudWatch alarm that triggers when this metric has not been received in the expected time window (e.g., no completion metric for the nightly billing job by 3 AM UTC) catches the case where the job ran but produced no output.

For MENA SaaS operators monitoring across multiple time zones, set CloudWatch alarms to notify in UTC and include the local time equivalent in the alarm description. A billing alarm firing at 3 AM UTC is 6 AM Beirut, 7 AM Riyadh, 7 AM Dubai. Including local times in alarm descriptions reduces confusion during incident response.

Key lessons from production

Use ECS Scheduled Tasks for most Go backend jobs. The startup overhead is acceptable for jobs that run hourly or less frequently. The operational overhead is low if you already run on ECS.

Implement idempotency in every scheduled job. EventBridge is at-least-once. Your job will run twice occasionally. Design for it.

Store execution state in PostgreSQL, not in memory. Container restarts happen. State that survives a restart is state that lets the job resume rather than start over.

Emit a heartbeat metric from long-running jobs. A job that processes 100,000 records silently and exits is hard to debug when something goes wrong at record 50,000. Emit progress metrics every few thousand records.

Test scheduled jobs locally by running them as CLI subcommands. The ability to run ./myapp billing-run --dry-run locally is worth the small amount of setup required to make the binary support subcommand mode.