Database Schema Migrations in Go Production Services: What Actually Works

Schema migrations are where production confidence goes to die. A clean migration script in development can lock a 50-million-row table for four minutes in production. A rollback that looks straightforward in staging is irreversible after production data has been written into the new schema. A migration that works perfectly in a single-tenant deployment fails silently in a multi-tenant system because it affects the wrong schema namespace.

These are not hypothetical problems. They are the specific failure modes that teams working on production Go backends encounter when they grow beyond their first few hundred tenants.

Why manual migration scripts fail at scale

The earliest version of schema management in most projects is a folder of numbered SQL files with a convention like 001_create_users.sql, 002_add_email_index.sql. A developer runs the new files manually when deploying. This works until:

Two developers create migration 007 independently. One gets merged first. The other has a conflict in the file name and gets renumbered to 008, but the content assumed 007 would already be applied. Databases in different environments now have different schemas applied in different orders.

Deployment automation requires knowing which migrations have already been applied. Without a tracking table, the automation has to either re-run everything (dangerous if migrations are not idempotent) or rely on a convention that breaks as the team grows.

The migration file has a syntax error that is only caught at deployment time on a production database. The application deployment fails, the migration is partially applied, and rolling back requires manual intervention.

golang-migrate: the standard and its limitations

golang-migrate is the standard migration tool for Go projects. It maintains a schema_migrations table that tracks which migrations have been applied. It supports PostgreSQL, MySQL, SQLite, and others. It can be embedded in your application binary and run migrations at startup, or used as a CLI tool in your deployment pipeline.

The basic setup:

import "github.com/golang-migrate/migrate/v4"
import _ "github.com/golang-migrate/migrate/v4/database/postgres"
import _ "github.com/golang-migrate/migrate/v4/source/file"

func RunMigrations(databaseURL string) error {
    m, err := migrate.New(
        "file://migrations",
        databaseURL,
    )
    if err != nil {
        return fmt.Errorf("creating migrator: %w", err)
    }
    defer m.Close()

    if err := m.Up(); err != nil && err != migrate.ErrNoChange {
        return fmt.Errorf("running migrations: %w", err)
    }
    return nil
}

Migration files follow the naming convention {version}_{title}.up.sql and {version}_{title}.down.sql. The version must be unique and increasing. Most teams use timestamps as version numbers to avoid conflicts when multiple developers are working simultaneously:

20260526140000_add_tenant_settings_table.up.sql
20260526140000_add_tenant_settings_table.down.sql

The limitations of golang-migrate that matter in production:

It acquires an advisory lock on the database to prevent concurrent migrations. This is correct behavior. But if a migration is running and your application starts on a second instance before the migration completes, the second instance will block waiting for the lock. Design your deployment to account for this.

It has no built-in support for migration timeouts. A migration that runs for 20 minutes will hold the lock for 20 minutes. Set a statement timeout at the connection level if you need to bound migration duration.

Down migrations are theoretically supported but practically unreliable. The down migration for "add a NOT NULL column to a 10-million-row table" is not just ALTER TABLE DROP COLUMN. It may include data that was written to that column that is now orphaned, application code that depends on the column, and foreign keys that reference rows created during the up migration window. In practice, rolling back a production database migration is an emergency procedure, not a routine deployment step.

The ALTER TABLE problem on large tables

ALTER TABLE ... ADD COLUMN on a table with tens of millions of rows takes an exclusive lock for the duration. In PostgreSQL 11 and later, adding a column with a non-volatile default value is instant because the default is stored at the table level rather than rewriting each row. But many teams still run into this.

The cases that are not instant and still require locks:

Adding a NOT NULL column with no default: PostgreSQL must verify or set the value for every existing row. The solution is a three-step migration.

-- Step 1: Add nullable column (instant)
ALTER TABLE orders ADD COLUMN processing_notes TEXT;

-- Step 2: Backfill in batches (slow, but no lock held continuously)
UPDATE orders SET processing_notes = '' WHERE id IN (
    SELECT id FROM orders WHERE processing_notes IS NULL LIMIT 10000
);
-- Repeat until complete

-- Step 3: Add NOT NULL constraint using NOT VALID + VALIDATE
ALTER TABLE orders ADD CONSTRAINT orders_processing_notes_not_null
    CHECK (processing_notes IS NOT NULL) NOT VALID;
-- Later, in a separate migration:
ALTER TABLE orders VALIDATE CONSTRAINT orders_processing_notes_not_null;

The NOT VALID trick is important. It adds the constraint without validating existing rows (instant), then VALIDATE CONSTRAINT validates existing rows while only holding a SHARE UPDATE EXCLUSIVE lock, which does not block normal read/write operations.

Creating an index on a large table is similar. Use CREATE INDEX CONCURRENTLY which builds the index without locking the table. It takes longer than a regular index build but can be done in production without downtime.

The expand/contract pattern for zero-downtime schema changes

For multi-tenant SaaS with multiple application instances running simultaneously, schema changes must be backward and forward compatible. An application instance running the old code must be able to read and write the database while another instance running the new code is also operating. This rules out any migration that removes a column, renames a column, or changes a column's type in a way that breaks the old code.

The expand/contract pattern (also called the additive migration pattern) handles this:

Phase 1 (Expand): Add the new column or table alongside the old one. The new column is optional. Old application code ignores it. New application code writes to both old and new.

Phase 2 (Migrate): Run a background job to backfill data from old column to new column for existing rows.

Phase 3 (Contract): Once all application instances are running code that uses the new column exclusively, and the backfill is complete, remove the old column. This removal can happen in a separate deployment after sufficient time has passed.

-- Phase 1: Add new column (this migration deploys with old code still running)
ALTER TABLE tenants ADD COLUMN settings_v2 JSONB;

-- Application code: write to both settings (old) and settings_v2 (new)
-- when both exist, read from settings_v2 preferentially

-- Phase 2: Background backfill
UPDATE tenants SET settings_v2 = settings::jsonb
WHERE settings_v2 IS NULL;

-- Phase 3: Drop old column (this migration deploys after all instances use settings_v2)
ALTER TABLE tenants DROP COLUMN settings;
ALTER TABLE tenants RENAME COLUMN settings_v2 TO settings;

This is more work than a single migration but it is the only safe approach when multiple application instances share a database and you deploy continuously.

Testing migrations in CI

Every migration file should be tested in CI before it reaches production. The test is simple: apply all migrations from scratch on an empty database and verify the final schema matches expectations. Some teams also run a subset of their test suite after migrations to catch the case where a migration breaks existing queries.

For multi-tenant systems where schema changes must not affect other tenants, the CI test should apply migrations on a multi-tenant test database and verify that only the intended tenants and schemas are affected.

Multi-tenant schema considerations

Multi-tenant PostgreSQL systems typically use one of three isolation models: shared tables with a tenant_id column, separate schemas per tenant, or separate databases per tenant.

golang-migrate works naturally with the shared-table model. For the separate-schema model, you need to run migrations against each tenant schema individually, which requires a migration runner that iterates over all tenant schemas. This is manageable for small tenant counts but requires careful design at hundreds of tenants.

Teams migrating from legacy systems in Lebanon and the MENA region, particularly from MySQL or MSSQL, frequently encounter the separate-database isolation model. Running golang-migrate across hundreds of individual databases per tenant requires a migration orchestration layer on top of golang-migrate, not just the library itself.

Key lessons from production

Never rely on the down migration for production rollbacks. Treat the down migration as documentation of what to undo, not as an automated rollback tool.

Use CREATE INDEX CONCURRENTLY for all index additions on tables that receive production traffic. The extra time is worth avoiding the lock.

Use the expand/contract pattern for any schema change that would break running application code. Build the habit, not the exception.

Run migrations as a separate step from application deployment, not at application startup in a multi-instance environment. The advisory lock approach means migration-at-startup serializes instances, and in a container orchestration environment like ECS or Kubernetes this creates unpredictable behavior.

For teams in Lebanon and the MENA region running production databases on AWS RDS PostgreSQL, the same migration patterns apply. RDS does not change the locking behavior of DDL operations.