Infrastructure Pitfalls¶

Known failure modes in infrastructure management. Every pitfall here has either happened in GE or is a known industry failure pattern. Some of these cost real money and real debugging hours.

1. Hot-Patching "Just This Once"¶

What happens: Production is broken. The fix is a one-line change. Someone decides to kubectl cp the fixed file into the running pod or SSH in and edit the config. The fix works. The pod restarts next week and the fix is gone because it was never committed to code.

GE-specific: Python caches modules in sys.modules at startup. Even if you kubectl cp a fixed Python file into a running executor pod, the old module is still loaded in memory. The fix does not take effect until the pod restarts — at which point the copied file is gone (container filesystem is ephemeral).

This has happened in GE. The lesson was painful enough to become a hard rule: NEVER hot-patch. ALWAYS rebuild the container image.

Rule: There is no "just this once." The pipeline exists for emergencies too. A hotfix branch, a fast build, a fast deploy. The pipeline should be fast enough that bypassing it is never justified.

2. Container Image Tag Reuse¶

What happens: The CI pipeline builds an image tagged v1.2.3. A bug is found. The developer fixes the bug, rebuilds, and pushes a new image with the same tag v1.2.3. The old image is overwritten. Now:

Rollback to v1.2.3 gives the new image, not the old one
Audit trail shows v1.2.3 deployed, but which v1.2.3?
Kubernetes may not pull the new image because it already has v1.2.3 cached (imagePullPolicy: IfNotPresent)

Prevention:

Tag images with commit SHA, not semantic version
Semantic version is metadata in the manifest, not the image tag
Never overwrite a published image tag
Use imagePullPolicy: Always only in development — staging and production use specific digests

3. Missing Rollback Procedure¶

What happens: A deployment goes wrong. The team decides to roll back. Nobody documented how to roll back this specific change. The deployment included a database migration, an infrastructure change, and a code change. Rolling back the code is easy (kubectl rollout undo). Rolling back the migration requires a down migration that was never written. Rolling back the infrastructure change requires knowing what the previous Terraform state was.

Prevention:

Every deployment plan includes a rollback procedure
Database migrations must include down migrations
Rollback is tested in staging before production deploy
Rutger does not approve production apply without a tested rollback procedure

4. Database Migration Not Reversible¶

What happens: A migration drops a column, renames a table, or changes a column type with data loss. When rollback is needed, the data is gone. The down migration cannot restore what was destroyed.

GE pattern — two-phase migration:

Phase 1: Add new column/table, deploy code that writes to both old and new, backfill new from old
Phase 2 (separate deployment): Deploy code that reads from new only, drop old column/table

Each phase is independently reversible. Data is never destroyed in the same deployment that stops using it.

Rule: If a migration cannot be reversed without data loss, it must be split into reversible phases. Boris and Yoanna enforce this.

5. Config Drift Between Zones¶

What happens: Staging and production configurations diverge. Staging uses one Redis port, production uses another. Staging has different resource limits. Staging has a feature flag enabled that production does not. Code works in staging, fails in production because of a configuration difference nobody noticed.

GE-specific: Redis port is 6381, not the default 6379. This is defined in config/ports.yaml. If staging uses 6379 and production uses 6381, integration tests pass in staging and fail in production.

Prevention:

Configuration structure is identical across zones — only values differ
Zone-specific values are in environment variables or zone-specific ConfigMaps, never in code
Thijmen verifies staging config matches production structure before approving staging
Config diffs between zones are reviewed as part of deployment

6. Secrets Not in Vault¶

What happens: A developer adds an API key as an environment variable in the Kubernetes manifest. The secret is now in version control. Anyone with repo access can read it. The secret cannot be rotated without a code change and redeployment.

Prevention:

All secrets in HashiCorp Vault
Vault injects secrets at runtime via sidecar or init container
Kubernetes Secrets only contain Vault references, not actual values
Repo scanning catches secrets in code (pre-commit hook)
Vault audit log tracks who accessed which secret when

7. No Image Scanning¶

What happens: A container image is built from a base image with known vulnerabilities. The application code is fine. The operating system libraries in the base image have CVEs. The image is deployed to production with known security holes.

Prevention:

Scan every image before deployment (CI step)
Block deployment on critical/high vulnerabilities
Pin base images to digest (not tag) for reproducibility
Update base images on a regular schedule
Use minimal base images (distroless, alpine) to reduce attack surface

8. Manual Kubernetes Manifest Edits¶

What happens: Someone edits a Kubernetes manifest directly with kubectl edit in production. The change works. A week later, a deployment overwrites the change because the manifest in version control does not contain it.

Worse: The edited manifest has a configuration that does not work with the next code version. The deployment fails, and nobody understands why because the manifest "has not changed" according to git.

Prevention:

All manifests in version control
kubectl edit is disabled in production (RBAC)
Manifests are generated from templates (Helm, Kustomize)
Drift detection alerts when cluster state diverges from git

9. Ignoring Resource Limits¶

What happens: A container has no resource limits. It consumes all available memory on the node. Other containers on the same node are OOM-killed. The node becomes unresponsive. Kubernetes reschedules pods, but they land on a node that also has the memory-hungry container.

GE-specific: The executor pods run LLM operations that can be memory-intensive. Without limits, a single expensive session can starve the entire node.

Prevention:

Every container has resource requests AND limits
Requests ensure scheduling (pod gets guaranteed resources)
Limits prevent resource starvation (pod cannot exceed)
HPA capped at maxReplicas: 5 with scaleUp stabilization: 120s
Monitoring alerts on containers approaching limits

10. Persistent Volume Assumptions¶

What happens: Code writes to the local filesystem expecting persistence. The pod restarts. The data is gone. The code expected a persistent volume but was using the ephemeral container filesystem.

GE-specific: This is why COMP-*.md files (agent completion records) need host cron sync to the database. The files are on the pod's filesystem and disappear when the pod restarts.

Prevention:

Never rely on container filesystem for persistent data
Use PersistentVolumeClaims for data that must survive restarts
Better: write directly to the database (PostgreSQL is SSOT)
Use Redis for ephemeral data that can be reconstructed

11. Rolling Update Without Readiness Probes¶

What happens: Kubernetes starts the new pod and immediately routes traffic to it. The pod is still initializing — loading config, connecting to the database, warming caches. Requests fail. Kubernetes does not know because there is no readiness probe.

Prevention:

Every pod has a readiness probe
Readiness probe checks actual functionality (not just "process running")
minReadySeconds: 30 — pod must be ready for 30s before old pod is terminated
Liveness probe restarts hung pods
Startup probe gives slow-starting pods time to initialize

12. Terraform State Corruption¶

What happens: Two people run terraform apply simultaneously. State file is corrupted. Terraform thinks resources exist that do not, or thinks resources do not exist that do. Applying again creates duplicates or destroys production resources.

Prevention:

Remote state backend with locking (S3 + DynamoDB, or similar)
State lock prevents concurrent applies
terraform plan reviewed before every apply
State backup before every apply
Arjan is the only agent authorized to run Terraform against production

Summary¶

Pitfall	Severity	How Caught
Hot-patching	Critical	Policy, code review, audit trail
Tag reuse	Critical	CI enforcement, digest pinning
Missing rollback	Critical	Leon's deployment plan review
Non-reversible migration	Critical	Boris/Yoanna review
Config drift	High	Thijmen staging verification
Secrets in code	Critical	Pre-commit scanning, Vault policy
No image scanning	High	CI pipeline gate
Manual manifest edits	High	RBAC, drift detection
Missing resource limits	High	Manifest validation in CI
PV assumptions	Medium	Architecture review
No readiness probes	High	Manifest validation in CI
State corruption	Critical	State locking, access control

Ownership¶

Role	Agent	Responsibility
Deployment Coordinator	Leon	Pipeline integrity, rollback procedures
Production Operations	Rutger	Production policy enforcement
Infrastructure	Arjan	Terraform safety, state management
Kubernetes Operator	Thijmen	Manifest validation, drift detection
Sysadmin	Gerco	Host-level security, access control
DBA (Alfa)	Boris	Migration reversibility
DBA (Bravo)	Yoanna	Migration reversibility