Skip to content

Pitfall: Reactive Debugging Without Architecture Context

The Anti-Pattern

Diagnosing CI/CD, deployment, or infrastructure failures by chasing error messages (fix error → next error → next error) instead of first consulting the architecture documentation to understand: - What infrastructure actually exists - What the deployment flow is designed to be - What the machine capabilities are - What services should be running

Incident: 2026-04-09

A full session was spent debugging CI pipeline failures reactively. The actual root causes were: 1. An expired ArgoCD token (5-minute fix) 2. Admin-UI pod in CrashLoopBackOff (missing .next build)

Instead of diagnosing from the architecture doc (production-deployment-architecture.md), the session produced: - A failed DAST self-contained approach (commit + revert) — unnecessary because DAST was already correctly configured to scan the deployed app - An incorrect "needs dedicated runner" conclusion — fort-knox-dev IS the dedicated 16c/64GB development environment - 6 deferrals of "deploy:staging fix — separate concern" — it was a 5-minute ArgoCD token refresh

The Rule

BEFORE diagnosing any infrastructure issue:

  1. Read ge-ops/wiki/docs/development/standards/production-deployment-architecture.md
  2. Check actual infrastructure state: kubectl get pods, service health endpoints
  3. Verify the deployment target is running (admin-ui, orchestrator, etc.)
  4. THEN look at CI job logs

Verification Checklist

When a CI pipeline fails on deploy/integration/e2e:

  • [ ] Is the target app running? (kubectl get pods -n ge-system)
  • [ ] Is ArgoCD healthy? (kubectl get pods -n argocd)
  • [ ] Is the ArgoCD token valid? (check ARGOCD_AUTH_TOKEN in CI variables)
  • [ ] Is the app reachable? (curl http://admin-ui.ge.internal/api/system/health)
  • [ ] Is DNS working from pods? (CoreDNS coredns-custom ConfigMap)

Key Facts (read, don't guess)

  • fort-knox-dev = 8 cores / 16 threads, 64 GB RAM, 1 TB SSD
  • fort-knox-dev IS the development/staging environment — there is no separate staging
  • The CI runner runs ON this machine with concurrent = 10
  • Admin-UI deployment uses hostPath mount (/home/claude/ge-bootstrap/admin-ui/app)
  • Must have .next build on disk for npm start to work
  • DAST scans the deployed app at admin-ui.ge.internal, not a self-built server

Captured from INC-20260409 session. Applies to all agents and future Claude sessions.