Infrastructure Pitfalls¶
Known failure modes in infrastructure management. Every pitfall here has either happened in GE or is a known industry failure pattern. Some of these cost real money and real debugging hours.
1. Hot-Patching "Just This Once"¶
What happens: Production is broken. The fix is a one-line change.
Someone decides to kubectl cp the fixed file into the running pod
or SSH in and edit the config. The fix works. The pod restarts
next week and the fix is gone because it was never committed to code.
GE-specific: Python caches modules in sys.modules at startup.
Even if you kubectl cp a fixed Python file into a running executor
pod, the old module is still loaded in memory. The fix does not
take effect until the pod restarts — at which point the copied file
is gone (container filesystem is ephemeral).
This has happened in GE. The lesson was painful enough to become a hard rule: NEVER hot-patch. ALWAYS rebuild the container image.
Rule: There is no "just this once." The pipeline exists for emergencies too. A hotfix branch, a fast build, a fast deploy. The pipeline should be fast enough that bypassing it is never justified.
2. Container Image Tag Reuse¶
What happens: The CI pipeline builds an image tagged v1.2.3.
A bug is found. The developer fixes the bug, rebuilds, and pushes
a new image with the same tag v1.2.3. The old image is overwritten.
Now:
- Rollback to
v1.2.3gives the new image, not the old one - Audit trail shows
v1.2.3deployed, but whichv1.2.3? - Kubernetes may not pull the new image because it already has
v1.2.3cached (imagePullPolicy: IfNotPresent)
Prevention:
- Tag images with commit SHA, not semantic version
- Semantic version is metadata in the manifest, not the image tag
- Never overwrite a published image tag
- Use
imagePullPolicy: Alwaysonly in development — staging and production use specific digests
3. Missing Rollback Procedure¶
What happens: A deployment goes wrong. The team decides to roll
back. Nobody documented how to roll back this specific change.
The deployment included a database migration, an infrastructure
change, and a code change. Rolling back the code is easy
(kubectl rollout undo). Rolling back the migration requires
a down migration that was never written. Rolling back the
infrastructure change requires knowing what the previous Terraform
state was.
Prevention:
- Every deployment plan includes a rollback procedure
- Database migrations must include
downmigrations - Rollback is tested in staging before production deploy
- Rutger does not approve production apply without a tested rollback procedure
4. Database Migration Not Reversible¶
What happens: A migration drops a column, renames a table,
or changes a column type with data loss. When rollback is needed,
the data is gone. The down migration cannot restore what was
destroyed.
GE pattern — two-phase migration:
- Phase 1: Add new column/table, deploy code that writes to both old and new, backfill new from old
- Phase 2 (separate deployment): Deploy code that reads from new only, drop old column/table
Each phase is independently reversible. Data is never destroyed in the same deployment that stops using it.
Rule: If a migration cannot be reversed without data loss, it must be split into reversible phases. Boris and Yoanna enforce this.
5. Config Drift Between Zones¶
What happens: Staging and production configurations diverge. Staging uses one Redis port, production uses another. Staging has different resource limits. Staging has a feature flag enabled that production does not. Code works in staging, fails in production because of a configuration difference nobody noticed.
GE-specific: Redis port is 6381, not the default 6379.
This is defined in config/ports.yaml. If staging uses 6379 and
production uses 6381, integration tests pass in staging and fail
in production.
Prevention:
- Configuration structure is identical across zones — only values differ
- Zone-specific values are in environment variables or zone-specific ConfigMaps, never in code
- Thijmen verifies staging config matches production structure before approving staging
- Config diffs between zones are reviewed as part of deployment
6. Secrets Not in Vault¶
What happens: A developer adds an API key as an environment variable in the Kubernetes manifest. The secret is now in version control. Anyone with repo access can read it. The secret cannot be rotated without a code change and redeployment.
Prevention:
- All secrets in HashiCorp Vault
- Vault injects secrets at runtime via sidecar or init container
- Kubernetes Secrets only contain Vault references, not actual values
- Repo scanning catches secrets in code (pre-commit hook)
- Vault audit log tracks who accessed which secret when
7. No Image Scanning¶
What happens: A container image is built from a base image with known vulnerabilities. The application code is fine. The operating system libraries in the base image have CVEs. The image is deployed to production with known security holes.
Prevention:
- Scan every image before deployment (CI step)
- Block deployment on critical/high vulnerabilities
- Pin base images to digest (not tag) for reproducibility
- Update base images on a regular schedule
- Use minimal base images (distroless, alpine) to reduce attack surface
8. Manual Kubernetes Manifest Edits¶
What happens: Someone edits a Kubernetes manifest directly
with kubectl edit in production. The change works. A week later,
a deployment overwrites the change because the manifest in version
control does not contain it.
Worse: The edited manifest has a configuration that does not work with the next code version. The deployment fails, and nobody understands why because the manifest "has not changed" according to git.
Prevention:
- All manifests in version control
kubectl editis disabled in production (RBAC)- Manifests are generated from templates (Helm, Kustomize)
- Drift detection alerts when cluster state diverges from git
9. Ignoring Resource Limits¶
What happens: A container has no resource limits. It consumes all available memory on the node. Other containers on the same node are OOM-killed. The node becomes unresponsive. Kubernetes reschedules pods, but they land on a node that also has the memory-hungry container.
GE-specific: The executor pods run LLM operations that can be memory-intensive. Without limits, a single expensive session can starve the entire node.
Prevention:
- Every container has resource requests AND limits
- Requests ensure scheduling (pod gets guaranteed resources)
- Limits prevent resource starvation (pod cannot exceed)
- HPA capped at maxReplicas: 5 with scaleUp stabilization: 120s
- Monitoring alerts on containers approaching limits
10. Persistent Volume Assumptions¶
What happens: Code writes to the local filesystem expecting persistence. The pod restarts. The data is gone. The code expected a persistent volume but was using the ephemeral container filesystem.
GE-specific: This is why COMP-*.md files (agent completion records) need host cron sync to the database. The files are on the pod's filesystem and disappear when the pod restarts.
Prevention:
- Never rely on container filesystem for persistent data
- Use PersistentVolumeClaims for data that must survive restarts
- Better: write directly to the database (PostgreSQL is SSOT)
- Use Redis for ephemeral data that can be reconstructed
11. Rolling Update Without Readiness Probes¶
What happens: Kubernetes starts the new pod and immediately routes traffic to it. The pod is still initializing — loading config, connecting to the database, warming caches. Requests fail. Kubernetes does not know because there is no readiness probe.
Prevention:
- Every pod has a readiness probe
- Readiness probe checks actual functionality (not just "process running")
minReadySeconds: 30— pod must be ready for 30s before old pod is terminated- Liveness probe restarts hung pods
- Startup probe gives slow-starting pods time to initialize
12. Terraform State Corruption¶
What happens: Two people run terraform apply simultaneously.
State file is corrupted. Terraform thinks resources exist that
do not, or thinks resources do not exist that do. Applying again
creates duplicates or destroys production resources.
Prevention:
- Remote state backend with locking (S3 + DynamoDB, or similar)
- State lock prevents concurrent applies
terraform planreviewed before every apply- State backup before every apply
- Arjan is the only agent authorized to run Terraform against production
Summary¶
| Pitfall | Severity | How Caught |
|---|---|---|
| Hot-patching | Critical | Policy, code review, audit trail |
| Tag reuse | Critical | CI enforcement, digest pinning |
| Missing rollback | Critical | Leon's deployment plan review |
| Non-reversible migration | Critical | Boris/Yoanna review |
| Config drift | High | Thijmen staging verification |
| Secrets in code | Critical | Pre-commit scanning, Vault policy |
| No image scanning | High | CI pipeline gate |
| Manual manifest edits | High | RBAC, drift detection |
| Missing resource limits | High | Manifest validation in CI |
| PV assumptions | Medium | Architecture review |
| No readiness probes | High | Manifest validation in CI |
| State corruption | Critical | State locking, access control |
Ownership¶
| Role | Agent | Responsibility |
|---|---|---|
| Deployment Coordinator | Leon | Pipeline integrity, rollback procedures |
| Production Operations | Rutger | Production policy enforcement |
| Infrastructure | Arjan | Terraform safety, state management |
| Kubernetes Operator | Thijmen | Manifest validation, drift detection |
| Sysadmin | Gerco | Host-level security, access control |
| DBA (Alfa) | Boris | Migration reversibility |
| DBA (Bravo) | Yoanna | Migration reversibility |