Pitfall: Reactive Debugging Without Architecture Context¶

The Anti-Pattern¶

Diagnosing CI/CD, deployment, or infrastructure failures by chasing error messages (fix error → next error → next error) instead of first consulting the architecture documentation to understand: - What infrastructure actually exists - What the deployment flow is designed to be - What the machine capabilities are - What services should be running

Incident: 2026-04-09¶

A full session was spent debugging CI pipeline failures reactively. The actual root causes were: 1. An expired ArgoCD token (5-minute fix) 2. Admin-UI pod in CrashLoopBackOff (missing .next build)

Instead of diagnosing from the architecture doc (production-deployment-architecture.md), the session produced: - A failed DAST self-contained approach (commit + revert) — unnecessary because DAST was already correctly configured to scan the deployed app - An incorrect "needs dedicated runner" conclusion — fort-knox-dev IS the dedicated 16c/64GB development environment - 6 deferrals of "deploy:staging fix — separate concern" — it was a 5-minute ArgoCD token refresh

The Rule¶

BEFORE diagnosing any infrastructure issue:

Read ge-ops/wiki/docs/development/standards/production-deployment-architecture.md
Check actual infrastructure state: kubectl get pods, service health endpoints
Verify the deployment target is running (admin-ui, orchestrator, etc.)
THEN look at CI job logs

Verification Checklist¶

When a CI pipeline fails on deploy/integration/e2e:

[ ] Is the target app running? (kubectl get pods -n ge-system)
[ ] Is ArgoCD healthy? (kubectl get pods -n argocd)
[ ] Is the ArgoCD token valid? (check ARGOCD_AUTH_TOKEN in CI variables)
[ ] Is the app reachable? (curl http://admin-ui.ge.internal/api/system/health)
[ ] Is DNS working from pods? (CoreDNS coredns-custom ConfigMap)

Key Facts (read, don't guess)¶

fort-knox-dev = 8 cores / 16 threads, 64 GB RAM, 1 TB SSD
fort-knox-dev IS the development/staging environment — there is no separate staging
The CI runner runs ON this machine with concurrent = 10
Admin-UI deployment uses hostPath mount (/home/claude/ge-bootstrap/admin-ui → /app)
Must have .next build on disk for npm start to work
DAST scans the deployed app at admin-ui.ge.internal, not a self-built server

Captured from INC-20260409 session. Applies to all agents and future Claude sessions.