Per-Tenant API Rate Limiting in Go: Production Patterns for SaaS Backends

A single misbehaving tenant can spike requests high enough to degrade every other tenant on the system. This is the per-tenant rate limiting architecture we run in production Go SaaS backends serving clients across Lebanon and the MENA region.

An API without rate limiting is an API with an outage waiting to happen. In a multi-tenant SaaS backend, a single misbehaving tenant can spike the request rate high enough to degrade the experience for every other tenant on the system. This is the per-tenant rate limiting architecture we run in production Go SaaS backends serving clients across Lebanon and the MENA region.

Why per-tenant matters more than global rate limiting

Global rate limiting rejects all requests above a total system threshold. This protects the server from overload but creates a fairness problem: one tenant spiking requests can exhaust the global quota and trigger 429 errors for every other tenant on the platform.

Per-tenant rate limiting gives each tenant their own independent quota bucket. A tenant sending 500 requests per second against a 100/minute limit gets throttled at their own limit. Every other tenant continues operating normally, unaffected by their neighbor's behavior.

In MENA SaaS with a diverse client mix, a single global limit is impossible to calibrate correctly. Enterprise tenants may legitimately send hundreds of API calls per minute as part of normal operations (batch jobs, automated reporting, webhook integrations). Small business tenants send requests in the single digits per minute. A limit that works for one breaks the other.

Per-tenant limits solve this by making the quota a function of the tenant's pricing tier: Standard tier gets 60 requests per minute, Pro gets 300, Enterprise gets a custom configured limit negotiated at contract time.

Token bucket in Go

The token bucket algorithm is the standard mechanism for API rate limiting. Each tenant has a bucket with a maximum token capacity and a refill rate. Each request consumes one token. When the bucket is empty, the next request is rejected with 429 Too Many Requests. Tokens refill at a steady rate over time.

Go's extended standard library provides golang.org/x/time/rate which implements a production-quality token bucket:

import "golang.org/x/time/rate"

// Limit of 100 requests per minute, burst capacity of 100
limiter := rate.NewLimiter(rate.Every(time.Minute/100), 100)

if !limiter.Allow() {
    http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
    return
}

rate.Every(time.Minute/100) sets the refill rate to 100 tokens per minute. The burst capacity (second argument) allows up to 100 tokens to be consumed instantly before the refill rate kicks in. This permits short legitimate bursts without penalizing normal usage patterns.

In-process limiter store with expiry

For a single-instance Go server, store per-tenant limiters in an in-memory map protected by a sync.RWMutex. The map grows with the number of active tenants and needs a cleanup mechanism to evict entries for tenants that have not sent requests recently.

type tenantLimiter struct {
    limiter  *rate.Limiter
    lastSeen time.Time
}

type LimiterStore struct {
    mu       sync.RWMutex
    limiters map[string]*tenantLimiter
}

func (s *LimiterStore) Get(tenantID string, rps rate.Limit, burst int) *rate.Limiter {
    s.mu.Lock()
    defer s.mu.Unlock()
    l, ok := s.limiters[tenantID]
    if !ok {
        l = &tenantLimiter{
            limiter: rate.NewLimiter(rps, burst),
        }
        s.limiters[tenantID] = l
    }
    l.lastSeen = time.Now()
    return l.limiter
}

func (s *LimiterStore) cleanup() {
    for range time.Tick(5 * time.Minute) {
        s.mu.Lock()
        for id, l := range s.limiters {
            if time.Since(l.lastSeen) > 15*time.Minute {
                delete(s.limiters, id)
            }
        }
        s.mu.Unlock()
    }
}

Start the cleanup goroutine once during server initialization. On a system with 500 active tenants and 15-minute eviction, the map stays small enough that memory pressure is negligible.

Redis-backed rate limiting for multi-instance ECS deployments

The in-process limiter store fails in multi-instance deployments. If three ECS task instances each maintain their own per-tenant limiter map, a tenant with a 100/minute limit can send 100 requests per minute to each instance for a total of 300 requests per minute before any single instance triggers a 429.

For ECS deployments with more than one task instance, move the rate limit counter to Redis. Use an atomic increment with expiry on a per-tenant-per-minute key:

func (r *RedisLimiter) Allow(ctx context.Context, tenantID string, limit int64) (bool, error) {
    minuteKey := fmt.Sprintf("rl:%s:%d", tenantID, time.Now().Unix()/60)
    count, err := r.rdb.Incr(ctx, minuteKey).Result()
    if err != nil {
        // Fail open on Redis unavailability: do not block legitimate traffic
        // because the rate limiter is down
        return true, fmt.Errorf("redis rate limit check failed: %w", err)
    }
    if count == 1 {
        // Set expiry on first increment; 120s covers the current and next minute window
        r.rdb.Expire(ctx, minuteKey, 120*time.Second)
    }
    return count <= limit, nil
}

The key includes the current Unix minute bucket. The 120-second expiry accounts for the minute boundary transition. The fail-open behavior on Redis errors is intentional: refusing all API traffic because the rate limiter service is unavailable would cause a worse outage than the behavior the rate limiter was designed to prevent.

Standard rate limit response headers

Return rate limit status headers on every API response, not just on 429 responses. Clients use these headers to monitor their own usage and implement backoff logic.

Standard headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 1748358060

On 429 responses, add:

Retry-After: 23

X-RateLimit-Reset is a Unix timestamp indicating when the current window resets and the client's quota refills. Retry-After is the number of seconds the client should wait before retrying. Both headers allow clients to implement polite backoff rather than hammering the API with retries.

Storing limits in PostgreSQL, caching in Redis

Rate limits are a tenant configuration property, not a hardcoded constant. Store them in PostgreSQL alongside the tenant's plan settings and cache the resolved limit in Redis with a 5-minute TTL.

CREATE TABLE tenant_rate_limits (
  tenant_id           UUID PRIMARY KEY REFERENCES tenants(id),
  requests_per_minute INTEGER NOT NULL DEFAULT 60,
  burst_size          INTEGER NOT NULL DEFAULT 120,
  updated_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

When a tenant upgrades from Standard to Pro, update the requests_per_minute value in this table. The new limit takes effect within the next cache TTL window (5 minutes) without any server restart or deployment.

The rate limiting middleware reads the tenant identity from the request context (using the tenant context propagation pattern described in our post on tenant context in Go SaaS), looks up the tenant's rate limit configuration from Redis cache, and applies the per-tenant limiter.

Key lessons from production

Per-tenant rate limiting is a fairness mechanism before it is a capacity protection mechanism. In-process limiters silently fail in multi-instance ECS deployments: move counter state to Redis. Always fail open on Redis unavailability. Return rate limit headers on every response, not just 429 responses. Store limit configuration in PostgreSQL and cache it in Redis so changes take effect without a deployment.