Skip to content

DOMAIN:INCIDENT_RESPONSE:PRODUCTION_DEBUGGING

OWNER: sandro (backend), tobias (frontend)
UPDATED: 2026-03-24
SCOPE: diagnosing production issues during incidents
AGENTS: sandro, tobias, mira


DEBUG:BACKEND_PATTERNS

LOG_ANALYSIS

TOOL: get recent error logs from k8s pod
RUN: kubectl logs deployment/<name> -n <namespace> --since=1h | grep -i error

TOOL: get logs from crashed pod
RUN: kubectl logs deployment/<name> -n <namespace> --previous

TOOL: stream live logs
RUN: kubectl logs -f deployment/<name> -n <namespace>

TOOL: search across all pods in namespace
RUN: kubectl logs -l app=<label> -n <namespace> --since=30m --all-containers

PATTERN: structured log search
IF: logs are JSON THEN: kubectl logs deployment/<name> -n <namespace> | jq 'select(.level == "error")'
IF: looking for specific request THEN: filter by request_id or trace_id
IF: looking for timing THEN: filter by timestamp range

REQUEST_TRACING

CHECK: does the request have a trace/correlation ID?
IF: yes THEN: search all service logs for that ID
IF: no THEN: correlate by timestamp + endpoint + client IP

PATTERN: trace a failed request
1. Find the error in application logs (timestamp, endpoint, error message)
2. Check upstream: did the request reach the service? (ingress logs)
3. Check downstream: did the service call external dependencies? (DB, Redis, APIs)
4. Check timing: was the request slow before it failed? (timeout vs immediate error)

TOOL: check ingress logs
RUN: kubectl logs -n kube-system -l app.kubernetes.io/name=traefik --since=1h | grep "<path>"

DB_QUERY_ANALYSIS

TOOL: check active queries (PostgreSQL)
RUN: psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;"

TOOL: check long-running queries
RUN: psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 seconds';"

TOOL: kill a runaway query
RUN: psql -c "SELECT pg_terminate_backend(<pid>);"
CAUTION: only kill after confirming the query is the problem and won't leave data in inconsistent state

TOOL: check table locks
RUN: psql -c "SELECT blocked_locks.pid AS blocked_pid, blocked_activity.query AS blocked_query, blocking_locks.pid AS blocking_pid, blocking_activity.query AS blocking_query FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted;"

TOOL: check connection count
RUN: psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

REDIS_DEBUGGING

TOOL: check Redis connectivity
RUN: redis-cli -p 6381 -a <password> ping

TOOL: monitor Redis commands in real-time (CAUTION: performance impact, use briefly)
RUN: redis-cli -p 6381 -a <password> monitor
RULE: never run MONITOR in production for more than 30 seconds — it logs ALL commands

TOOL: check Redis memory
RUN: redis-cli -p 6381 -a <password> info memory

TOOL: check Redis stream length (detect backlog)
RUN: redis-cli -p 6381 -a <password> xlen triggers.<agent>
RUN: redis-cli -p 6381 -a <password> xlen ge:work:incoming

TOOL: check consumer group lag
RUN: redis-cli -p 6381 -a <password> xinfo groups triggers.<agent>
RUN: redis-cli -p 6381 -a <password> xpending triggers.<agent> <group>

TOOL: check Redis slow log
RUN: redis-cli -p 6381 -a <password> slowlog get 10


DEBUG:FRONTEND_PATTERNS

BROWSER_CONSOLE

CHECK: JavaScript errors in console (red messages)
CHECK: failed network requests (4xx, 5xx, CORS errors)
CHECK: React error boundaries triggered (error overlay in dev, blank components in prod)
CHECK: console warnings about deprecated APIs, missing keys, prop type mismatches

NETWORK_TAB

PATTERN: diagnose slow page load
1. Open Network tab, reload page
2. CHECK: waterfall — is one request blocking others?
3. CHECK: large payloads — any response > 1MB?
4. CHECK: many requests — more than 50 on initial load?
5. CHECK: slow TTFB — server processing time > 500ms?

PATTERN: diagnose failed API call
1. Find the failed request in Network tab
2. CHECK: request headers — auth token present? correct content-type?
3. CHECK: request payload — correct shape? required fields present?
4. CHECK: response status — 401 (auth), 403 (perms), 404 (endpoint), 422 (validation), 500 (server)
5. CHECK: response body — error message from server
6. CHECK: timing — did it timeout? (net::ERR_TIMED_OUT)

REACT_DEVTOOLS

CHECK: component re-render frequency (Profiler tab)
CHECK: state/props values on broken component (Components tab)
CHECK: context values reaching the component
CHECK: missing error boundary catching errors silently

ERROR_BOUNDARIES

IF: white screen in production THEN: likely uncaught error in React render
CHECK: error boundary logs — should capture component stack
IF: no error boundary THEN: error goes to window.onerror only
RULE: every route-level component must have an error boundary
RULE: error boundaries must report errors to monitoring, not just show fallback UI


DEBUG:COMMON_FAILURE_MODES

MEMORY_LEAK_NODEJS

SYMPTOMS: RSS memory grows over time, OOMKilled pods, increasing GC pauses
CAUSES:
- Event listener accumulation (addEventListener without removeEventListener)
- Unclosed streams/connections
- Global variable accumulation (caching without eviction)
- Closure references holding large objects
- Next.js SSR: per-request state leaking to module scope

TOOL: check pod memory usage
RUN: kubectl top pods -n <namespace>

TOOL: check OOMKill events
RUN: kubectl get events -n <namespace> --field-selector reason=OOMKilling

TOOL: get Node.js heap snapshot (if debug endpoint available)
RUN: kubectl exec -n <namespace> <pod> -- node -e "require('v8').writeHeapSnapshot()"

FIX: set memory limits in k8s deployment (requests AND limits)
FIX: use WeakMap/WeakRef for caches
FIX: ensure SSR handlers don't write to module-level variables

ANTI_PATTERN: increasing memory limit to "fix" a leak
FIX: find and fix the leak — higher limit just delays the OOMKill

CONNECTION_POOL_EXHAUSTION_POSTGRESQL

SYMPTOMS: "too many connections" errors, new connections hang, queries timeout
CAUSES:
- Pool size too small for concurrent load
- Connections not returned to pool (missing .release() or uncaught errors)
- Long-running transactions holding connections
- Connection leak in error paths (try without finally)

TOOL: check current connections
RUN: psql -c "SELECT count(*) FROM pg_stat_activity;"
RUN: psql -c "SHOW max_connections;"

TOOL: check connections by application
RUN: psql -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name;"

TOOL: check idle connections
RUN: psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

FIX: pool size = (2 * CPU cores) + effective_spindle_count (for disk-based systems)
FIX: for SSR apps with many pods: pool_size_per_pod = max_connections / num_pods - 5 (buffer)
FIX: set idle timeout on pool (30s default, reduce to 10s under pressure)
FIX: add connection pool monitoring (log pool.totalCount, pool.idleCount, pool.waitingCount)

RULE: every database operation must use connection pooling — never raw pg.Client
RULE: connection acquisition must have a timeout (5s max)

N_PLUS_1_QUERIES

SYMPTOMS: page load scales linearly with data count, many similar queries in logs
CAUSES:
- ORM lazy loading (fetch parent, then N queries for children)
- Loop with DB call inside
- GraphQL resolvers fetching per-field

TOOL: detect N+1 in PostgreSQL
RUN: psql -c "SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY calls DESC LIMIT 20;"
LOOK_FOR: same query pattern with high call count

FIX: use eager loading / JOIN / include in ORM query
FIX: use DataLoader pattern for GraphQL
FIX: batch queries: WHERE id IN (...) instead of N individual queries

RACE_CONDITIONS

SYMPTOMS: intermittent failures, data inconsistency, "works on retry"
CAUSES:
- Read-then-write without lock (check-then-act)
- Concurrent updates to same resource
- Event ordering assumptions violated
- Double-submit on form/button

TOOL: detect concurrent access in PostgreSQL
RUN: psql -c "SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock';"

FIX: use database transactions with appropriate isolation level
FIX: use SELECT FOR UPDATE for read-then-write patterns
FIX: use optimistic locking (version column) for infrequent conflicts
FIX: use pessimistic locking (FOR UPDATE) for frequent conflicts
FIX: frontend: disable button on submit, use request dedup

CERTIFICATE_EXPIRY

SYMPTOMS: TLS handshake failures, "certificate expired" errors, HTTPS failures
CAUSES:
- cert-manager misconfiguration
- DNS challenge failing (for Let's Encrypt)
- Certificate not renewed before expiry

TOOL: check certificate expiry
RUN: kubectl get certificates -A
RUN: openssl s_client -connect <host>:443 -servername <host> 2>/dev/null | openssl x509 -noout -dates

TOOL: check cert-manager logs
RUN: kubectl logs -n cert-manager deployment/cert-manager --since=1h

FIX: set up monitoring alert for certificates expiring within 14 days
FIX: verify cert-manager RBAC and DNS challenge permissions

DNS_RESOLUTION_FAILURES

SYMPTOMS: "ENOTFOUND" errors, intermittent connection failures, service discovery failures
CAUSES:
- CoreDNS pod crashed or overloaded
- ndots:5 causing excessive DNS lookups in k8s
- DNS cache TTL too short under load
- External DNS provider issues

TOOL: check CoreDNS health
RUN: kubectl get pods -n kube-system -l k8s-app=kube-dns
RUN: kubectl logs -n kube-system -l k8s-app=kube-dns --since=30m

TOOL: test DNS from inside pod
RUN: kubectl exec -n <namespace> <pod> -- nslookup <service>.<namespace>.svc.cluster.local

TOOL: check DNS resolution time
RUN: kubectl exec -n <namespace> <pod> -- time nslookup <hostname>

FIX: use FQDN with trailing dot for external domains in k8s (avoids ndots search)
FIX: add dnsConfig.options ndots:2 to pod spec (reduces search domain queries)
FIX: check CoreDNS cache and forward configs


DEBUG:K8S_SPECIFIC

TOOL: check pod status and events
RUN: kubectl describe pod <pod> -n <namespace>

TOOL: check recent events in namespace
RUN: kubectl get events -n <namespace> --sort-by=.lastTimestamp | tail -20

TOOL: check resource usage
RUN: kubectl top pods -n <namespace>
RUN: kubectl top nodes

TOOL: check pod restart count
RUN: kubectl get pods -n <namespace> -o wide
LOOK_FOR: RESTARTS column > 0

TOOL: exec into pod for debugging
RUN: kubectl exec -it -n <namespace> <pod> -- /bin/sh
RULE: never modify production data from exec shell — read only

TOOL: port-forward for local debugging
RUN: kubectl port-forward -n <namespace> svc/<service> <local-port>:<service-port>