Deployment Pipeline¶
The full deploy chain from merged code to running production. Every step is automated, auditable, and reversible.
Pipeline Overview¶
Code Merged (Marta/Iwona approval)
↓
Leon (Deployment Coordinator) — orchestrates entire chain
↓
Boris / Yoanna (DB Migration) — schema changes first
↓
Arjan (Infrastructure) — Terraform apply if infra changes
↓
Container Build (CI) — image built, tagged, scanned
↓
Thijmen (Staging Verify) — deploy to staging, verify
↓
Rutger (Production Apply) — deploy to production
↓
Stef (Network) — DNS, certificates, routing
↓
Karel (CDN) — edge cache invalidation, asset deploy
↓
Otto (Backup) — post-deploy backup verification
Step 1: Deployment Orchestration¶
Owner: Leon (Deployment Coordinator) Input: Merged PR with approved changes Output: Deployment plan with step sequence
What Leon does¶
Leon receives notification of a merged PR and creates a deployment plan. The plan determines:
- Which steps are needed (not every deploy needs DB migration or infra changes)
- The order of execution
- Rollback procedures for each step
- Success criteria for each step
What blocks this step¶
- PR not approved by Marta/Iwona (merge gate)
- Container image build failed
- Previous deployment still in progress
- Active incident in production
Rollback¶
Leon coordinates rollback across all steps. If any downstream step fails, Leon triggers rollback in reverse order.
Step 2: Database Migration¶
Owner: Boris (DBA, Alfa) / Yoanna (DBA, Bravo) Input: Migration files from the merged PR Output: Schema changes applied to the target environment
What Boris/Yoanna do¶
Database migrations run before application deployment. This ensures the new code deploys against the new schema. The migration process:
- Review migration SQL — verify it is reversible
- Run migration on staging — verify it succeeds
- Verify data integrity — run constraint checks
- Run migration on production — within a maintenance window if destructive
- Verify production schema — matches expected state
Migration rules¶
- Every migration must be reversible. Include both
upanddownSQL. If a migration cannot be reversed (e.g., dropping a column with data), document the data recovery procedure. - Never run destructive migrations without a backup. Otto verifies backup exists before Boris/Yoanna proceed.
- Migrations are additive first. Add the new column, deploy new code, then remove the old column in a separate deployment. Never rename/remove and deploy simultaneously.
- Use Drizzle migrations. Migrations live in
drizzle/migrations/. No hand-written SQL applied directly.
What blocks this step¶
- Migration is not reversible and no recovery procedure documented
- Backup not verified by Otto
- Migration fails on staging
- Data integrity check fails after migration
Rollback¶
Run the down migration. If the down migration fails, restore
from the pre-migration backup (Otto provides).
Step 3: Infrastructure Changes¶
Owner: Arjan (Infrastructure Provisioner) Input: Terraform changes from the merged PR Output: Infrastructure updated to match desired state
What Arjan does¶
When the deployment includes infrastructure changes (new services, scaling changes, network policy updates), Arjan applies them via Terraform:
terraform plan— review the diff, verify no unintended changes- Apply to staging — verify infrastructure is functional
- Apply to production — after staging verification
Infrastructure-as-code rules¶
- All infrastructure is defined in Terraform. No ClickOps,
no manual
kubectl apply, no imperative commands. - State is remote. Terraform state is stored centrally, not on anyone's machine.
- Plan before apply. Every
terraform applyis preceded by aterraform planreview. No blind applies. - Blast radius limits. Changes affecting more than 5 resources require human approval.
What blocks this step¶
terraform planshows unintended changes- Staging infrastructure verification fails
- Blast radius exceeds limit without human approval
- State lock held by another process
Rollback¶
Re-apply the previous Terraform state. Terraform tracks state history, making rollback deterministic.
Step 4: Container Build¶
Owner: CI pipeline (automated) Input: Source code from the merged commit Output: Tagged container image, security-scanned
What happens¶
- Build — Multi-stage Docker build from Dockerfile
- Tag — Image tagged with commit SHA (never
latest) - Scan — Security scan for known vulnerabilities
- Import — Image imported to k3s container registry
(
docker save | k3s ctr images import)
Build rules¶
- Build runs in CI, not on developer machines
- Base images pinned to digest
- Build cache used for speed, but cache is never shipped
- Image size enforced (no build tools in runtime image)
- Scan must pass — no critical or high vulnerabilities
What blocks this step¶
- Dockerfile syntax error
- Build failure (dependency resolution, compilation)
- Security scan finds critical/high vulnerability
- Image exceeds size limit
Rollback¶
Not applicable — the image exists or it does not. If the build fails, deployment stops.
Step 5: Staging Verification¶
Owner: Thijmen (Kubernetes Operator) Input: Container image, deployment manifests Output: Staging environment running and verified
What Thijmen does¶
Thijmen deploys to the staging environment and runs verification:
- Deploy — Apply Kubernetes manifests to staging namespace
- Health check — All pods running, readiness probes passing
- Smoke tests — Core functionality works end-to-end
- Integration tests — Full test suite against staging
- Performance baseline — Response times within expected range
- Compare — Staging behavior matches previous staging deployment
Staging rules¶
- Staging mirrors production topology. Same services, same configuration structure (different values), same network policies.
- Staging uses production-equivalent data. Anonymized production data snapshot, not synthetic data.
- Staging runs for at least 15 minutes before production deploy is approved. Some issues only appear under sustained load.
What blocks this step¶
- Pod crash loops
- Readiness probe failures
- Smoke test failures
- Integration test failures
- Performance regression (> 20% slower than baseline)
Rollback¶
Redeploy previous staging manifests. Staging rollback is practice for production rollback.
Step 6: Production Apply¶
Owner: Rutger (Production Operations Engineer) Input: Staging-verified container image and manifests Output: Production running new version
What Rutger does¶
Rutger is the final human gate. Before applying to production:
- Verify image match — Production image tag matches staging
- Check maintenance window — Deploy during low-traffic period for high-risk changes
- Verify rollback procedure — Tested in staging
- Apply — Rolling update to production
- Monitor — Watch health metrics for 15 minutes post-deploy
- Declare success — Or trigger rollback
Rolling update strategy¶
- maxSurge: 1 — At most 1 extra pod during update
- maxUnavailable: 0 — Never fewer pods than desired
- minReadySeconds: 30 — Pod must be healthy for 30s before old pod is terminated
- progressDeadlineSeconds: 300 — Fail if update takes > 5 min
What blocks this step¶
- Staging verification not passed
- Active incident in production
- No rollback procedure documented
- Deploy outside maintenance window for high-risk changes
- Otto has not confirmed backup exists
Rollback¶
Rollback to the previous ReplicaSet. The previous container image is still in the registry. Rollback takes < 60 seconds.
Step 7: Network Configuration¶
Owner: Stef (Network + DNS + Certificates Engineer) Input: Network changes from the deployment plan Output: DNS records, TLS certificates, routing rules updated
What Stef does¶
When the deployment includes network changes:
- DNS updates — New records, changed records, TTL adjustments
- TLS certificates — New certificates issued, renewals processed
- Ingress rules — Routing updates for new endpoints
- Network policies — Firewall rules for new services
- Load balancer — Backend pool updates
Network rules¶
- DNS through TransIP API — no manual DNS edits
- Certificates via Let's Encrypt — automated issuance and renewal
- All traffic TLS — no plaintext HTTP in production
- Network policies default-deny — explicit allow rules only
What blocks this step¶
- DNS propagation failure
- Certificate issuance failure
- Network policy blocks required traffic
- Load balancer health check failure
Rollback¶
Revert DNS records (TTL-dependent), revert ingress rules, revert network policies. Certificate rollback not needed (old certificate remains valid).
Step 8: CDN and Edge¶
Owner: Karel (Edge Platform Specialist) Input: Static assets and cache invalidation requirements Output: Edge caches updated, assets deployed
What Karel does¶
- Asset deployment — Push new static assets to CDN origin
- Cache invalidation — Purge stale cached content
- Edge rules — Update edge-side routing or transformation rules
- Verification — Confirm assets are served from edge with correct headers and content
CDN rules¶
- EU-only routing — bunny.net routing filter ensures traffic stays within EU (see EU Data Sovereignty)
- Cache-busting via content hash — filenames include hash, eliminating stale cache issues
- Immutable assets — Static files are never overwritten, only new versions are deployed
- Origin shield — Reduce origin load by routing through a shield PoP
What blocks this step¶
- Asset verification failure (wrong content, missing files)
- Cache invalidation incomplete
- Edge rules syntax error
Rollback¶
Deploy previous asset version. Since assets are content-hashed and never overwritten, the previous version is still on the origin.
Step 9: Backup Verification¶
Owner: Otto (Backup Guardian + BCP) Input: Post-deployment production state Output: Backup verification report
What Otto does¶
After every production deployment, Otto verifies:
- Pre-deploy backup exists — Taken before Step 6
- Post-deploy backup runs — Captures the new production state
- Backup restore test — Verifies the backup can be restored (monthly full test, per-deploy spot check)
- Retention compliance — Backup retention meets policy
- Cross-zone replication — Backup exists in a different availability zone
Backup rules¶
- Pre-deploy backup is mandatory. Rutger will not apply without Otto's confirmation.
- Database and persistent volumes are backed up
- Backup encryption — All backups encrypted at rest
- Backup location — EU-only storage (see EU Data Sovereignty)
- Retention — 30 daily, 12 monthly, 7 yearly
What blocks this step¶
- Backup creation failure
- Backup verification failure
- Restore test failure (monthly)
Rollback¶
Not applicable to this step — Otto verifies, does not deploy. If backup verification fails, the deployment is flagged for rollback review.
Three-Zone Separation¶
GE operates three zones with strict boundaries:
Development¶
- Local k3s cluster on developer machines
- Full stack runs locally
- No access to staging or production data
- Free to experiment, break things, iterate
Staging¶
- k3s cluster mirroring production topology
- Anonymized production data snapshot
- Same network policies as production
- Performance testing runs here
- No customer-facing traffic
Production¶
- k3s production cluster
- Real customer data, real traffic
- Changes only through deployment pipeline
- No direct access — all operations through tooling
- 24/7 monitoring and alerting
Zone boundaries¶
| What | Dev | Staging | Prod |
|---|---|---|---|
| Data | Synthetic | Anonymized prod | Real |
| Access | Open | Restricted | Pipeline only |
| Changes | Direct | Pipeline | Pipeline |
| Monitoring | Optional | Required | Required + alerting |
| Backups | None | Daily | Pre/post-deploy + daily |
Pipeline Timing¶
| Step | Owner | Typical Duration | Parallel |
|---|---|---|---|
| Orchestration | Leon | 1 min | No |
| DB Migration | Boris/Yoanna | 2-10 min | No |
| Infrastructure | Arjan | 3-15 min | No |
| Container Build | CI | 3-5 min | With Step 2-3 |
| Staging Verify | Thijmen | 15-30 min | No |
| Production Apply | Rutger | 5-10 min | No |
| Network | Stef | 2-5 min | With Step 6 |
| CDN | Karel | 2-5 min | With Step 6 |
| Backup Verify | Otto | 5-10 min | After Step 6 |
Total typical deployment time: 30-60 minutes for standard changes. Critical changes with extended staging verification: up to 2 hours.
Ownership¶
| Role | Agent | Responsibility |
|---|---|---|
| Deployment Coordinator | Leon | Pipeline orchestration, sequencing |
| Production Operations | Rutger | Production apply, monitoring, rollback |
| DBA (Alfa) | Boris | Database migrations |
| DBA (Bravo) | Yoanna | Database migrations |
| Infrastructure | Arjan | Terraform, infrastructure-as-code |
| Kubernetes Operator | Thijmen | Staging deploy and verification |
| Network Engineer | Stef | DNS, TLS, routing, network policies |
| Edge Specialist | Karel | CDN, edge caching, asset deployment |
| Backup Guardian | Otto | Pre/post-deploy backups, restore testing |
| Sysadmin | Gerco | Host-level operations, OS updates |