Most Go backend performance problems in production look identical at first: high CPU under load, memory that climbs and never drops, or latency spikes that disappear before you can diagnose them. Go ships a profiling toolkit that works against live production processes without a restart, and it is the fastest path to the root cause.
Most Go backend performance problems in production look identical at first: high CPU under load, memory that climbs and never drops, or latency spikes that disappear before you can diagnose them. Go ships a profiling toolkit that works against live production processes without a restart, and it is the fastest path to the root cause.
What Go's built-in profiler gives you
Go's net/http/pprof package exposes a set of HTTP endpoints on any running Go process. Registering it takes two lines:
import _ "net/http/pprof"
// In your server setup, on an internal-only port:
go http.ListenAndServe("localhost:6060", nil)
That exposes:
/debug/pprof/heap- memory allocation profile/debug/pprof/goroutine- all active goroutines and their stacks/debug/pprof/profile?seconds=30- 30-second CPU sample/debug/pprof/trace?seconds=5- execution trace/debug/pprof/block- goroutine blocking profile/debug/pprof/mutex- mutex contention profile
You capture a profile from a live production process with go tool pprof pointing at the remote endpoint:
go tool pprof -http=:8888 http://prod-host:6060/debug/pprof/heap
This opens an interactive web UI showing the heap allocation graph, a flame graph of allocation call stacks, and a source-level breakdown of which functions allocate the most memory. The process keeps running. No restart required.
Critical security note: the pprof endpoint must never be publicly accessible. Bind it to localhost or an internal VPC address. On ECS, use ECS Exec to port-forward to the task before running pprof.
Diagnosing a memory leak in a Go SaaS backend
A common production pattern in multi-tenant Go SaaS systems: memory climbs steadily over 12 hours, then the ECS task gets OOM-killed. Restarting fixes it temporarily but the pattern repeats every day.
The heap profile from a process near its memory peak shows the allocation source. A real example from a SaaS product operating in Lebanon: the heap profile showed 2.3 GB allocated inside a function called buildReportContext in the reporting service. The call stack revealed a slice being returned by reference from a cache:
// The bug: returning a direct reference to a cached slice
func (c *Cache) GetReportRows(tenantID string) []ReportRow {
return c.data[tenantID] // caller gets the same underlying array
}
// The caller appends to the slice, growing the cached backing array:
rows := cache.GetReportRows(tenantID)
rows = append(rows, extraRow) // mutates the cached backing array if capacity allows
Over thousands of requests, the cached slice grew without bound. The fix was returning a copy:
func (c *Cache) GetReportRows(tenantID string) []ReportRow {
src := c.data[tenantID]
result := make([]ReportRow, len(src))
copy(result, src)
return result
}
Without the heap profile, this class of bug takes days to find by code review alone. The profile found the exact function and line in under 10 minutes.
Understanding goroutine leaks
A goroutine leak is a process that creates goroutines that never exit. They accumulate over time, each holding memory (typically 2 to 8 KB minimum stack size), and the process eventually runs out of memory or becomes unresponsive.
The goroutine profile shows every running goroutine and its current call stack. A leak looks like thousands of goroutines all blocked at the same site:
# go tool pprof http://localhost:6060/debug/pprof/goroutine
# Result: 8,400 goroutines blocked at:
runtime.gopark
database/sql.(*DB).conn
8,400 goroutines waiting for a database connection means the connection pool is exhausted. Every incoming request spawns a goroutine that blocks waiting for a pool slot that never frees because previous goroutines are also blocking. The system deadlocks under sustained load.
The root cause is usually a missing context cancellation or a transaction that was not rolled back on error. The goroutine profile identifies the symptom; the trace tool helps find the cause:
go tool pprof -http=:8888 'http://localhost:6060/debug/pprof/trace?seconds=10'
The execution trace shows goroutine scheduling, GC pauses, and blocking events over time. It is heavier than a CPU profile but gives a precise timeline of what the runtime was doing during the problem window.
CPU profiling under real production load
CPU profiles are most meaningful when captured during actual production load. A 30-second CPU profile sampled at 100 Hz shows an aggregate picture of where CPU time goes.
A real pattern from a Go API in the MENA region: a high-traffic endpoint was doing per-request JSON marshaling of large tenant-specific objects that were essentially identical for all users of the same tenant. The CPU profile showed 60% of CPU time spent in encoding/json.Marshal for that single endpoint. The fix was a short TTL cache keyed on tenant ID that reused the serialized bytes across requests:
type ResponseCache struct {
mu sync.RWMutex
entries map[string]cachedEntry
}
type cachedEntry struct {
data []byte
expiresAt time.Time
}
func (c *ResponseCache) Get(key string) ([]byte, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
entry, ok := c.entries[key]
if !ok || time.Now().After(entry.expiresAt) {
return nil, false
}
return entry.data, true
}
func (c *ResponseCache) Set(key string, data []byte, ttl time.Duration) {
c.mu.Lock()
defer c.mu.Unlock()
c.entries[key] = cachedEntry{data: data, expiresAt: time.Now().Add(ttl)}
}
CPU usage on that endpoint dropped by 55% after the change. The profile made the opportunity obvious in a way that no amount of code reading would have.
Continuous profiling in production
Capturing profiles manually requires knowing when to look. A better pattern is continuous profiling: capture a heap profile every 15 minutes and push it to S3. This creates a time-series of profiles you can compare before and after a deployment or during and after an incident.
func StartContinuousProfiling(ctx context.Context, bucket string, s3Client S3Client) {
ticker := time.NewTicker(15 * time.Minute)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case t := <-ticker.C:
go captureHeapProfile(ctx, t, bucket, s3Client)
}
}
}
func captureHeapProfile(ctx context.Context, t time.Time, bucket string, s3 S3Client) {
var buf bytes.Buffer
if err := pprof.WriteHeapProfile(&buf); err != nil {
return
}
key := fmt.Sprintf("profiles/heap/%s.pb.gz", t.UTC().Format("2006-01-02T15-04-05"))
s3.PutObject(ctx, bucket, key, &buf)
}
Teams running SaaS products across MENA that have implemented continuous profiling report finding and fixing memory regressions within one deploy cycle rather than waiting for an OOM crash.
Blocking and mutex profiles for concurrency problems
Two profiles that are underused but extremely valuable when dealing with concurrency bottlenecks:
The blocking profile (/debug/pprof/block) shows goroutines that spent time waiting on channels, select statements, or sync primitives. If a Go service has high CPU but throughput is lower than expected, the blocking profile often reveals goroutines queuing at a shared channel that is too narrow.
The mutex profile (/debug/pprof/mutex) shows which mutexes in the code are most contended. In a multi-tenant SaaS, a single global mutex protecting a shared cache is a common culprit that the mutex profile identifies immediately.
Both profiles are disabled by default to avoid overhead. Enable them when needed:
runtime.SetBlockProfileRate(1) // sample every blocking event
runtime.SetMutexProfileFraction(5) // sample 1 in 5 mutex events
Disable them again once profiling is complete.
Key lessons from production
Register the pprof endpoint on every Go service from day one. Overhead when not actively profiling is zero. Value when you need it is hours of debugging saved.
Memory leaks in Go almost always come from three sources: slice mutation through cached references, maps that grow without eviction, or goroutines leaked by uncancelled contexts. The heap profile finds all three in minutes.
CPU profiles captured under real production load reveal bottlenecks that synthetic benchmarks miss completely. Schedule a profiling session during peak traffic, not only during local testing.
Enjoying this article?
Enter your email and get a clean, formatted PDF of this article - free, no spam.
Not sure where to start?
Voxire builds and maintains production Go backend systems for SaaS companies in Lebanon and across the MENA region. If your backend has performance problems you cannot diagnose, or you want profiling infrastructure built in from the start, reach out at https://voxire.com/get-a-quote/



