DOMAIN:INCIDENT_RESPONSE:HOTFIX_PROCEDURES¶
OWNER: sandro (backend), tobias (frontend)
UPDATED: 2026-03-24
SCOPE: production hotfixes during and after incidents
AGENTS: sandro, tobias, mira (approval)
HOTFIX:BRANCHING¶
RULE: hotfix branches from main (production), NEVER from develop
RULE: branch naming: hotfix/INC-YYYY-NNNN-short-description
RULE: after merge to main, ALSO merge to develop (prevent regression on next release)
FLOW:
main ──────┬──────────── merge hotfix ──────── main
│ ↑
└── hotfix/INC-* ┘
│
develop ────────────────── merge hotfix ────── develop
TOOL: create hotfix branch
RUN: git checkout main && git pull origin main && git checkout -b hotfix/INC-YYYY-NNNN-description
TOOL: complete hotfix merge
RUN: git checkout main && git merge --no-ff hotfix/INC-YYYY-NNNN-description
RUN: git checkout develop && git merge --no-ff hotfix/INC-YYYY-NNNN-description
RUN: git branch -d hotfix/INC-YYYY-NNNN-description
ANTI_PATTERN: branching from develop for hotfix — includes unreleased features, unpredictable behavior
FIX: always branch from main — main IS production
ANTI_PATTERN: forgetting to merge hotfix back to develop
FIX: hotfix checklist includes develop merge as mandatory step
HOTFIX:EXPEDITED_CODE_REVIEW¶
RULE: hotfixes MUST still be reviewed — but review scope is narrowed
MUST_CHECK (non-negotiable even under time pressure)¶
- [ ] Does the fix address the root cause (not just symptoms)?
- [ ] Does it introduce any security risk? (injection, auth bypass, data exposure)
- [ ] Does it affect data integrity? (migrations, writes, deletes)
- [ ] Is the change scoped tightly to the incident? (no unrelated changes bundled)
- [ ] Are there any obvious performance regressions? (N+1 queries, missing indexes)
MAY_SKIP (during SEV1/SEV2 only, must be addressed in follow-up)¶
- Code style / formatting nits
- Test coverage for edge cases (core path test still required)
- Documentation updates
- Refactoring suggestions
- Type completeness (if TypeScript, basic types still required)
REVIEW_FLOW¶
IF: SEV1 THEN: single reviewer (mira or senior dev), review during PR creation, 15min max
IF: SEV2 THEN: single reviewer, 30min max
IF: SEV3/SEV4 THEN: normal review process
RULE: expedited review generates a follow-up ticket for full review within 1 week
RULE: reviewer must explicitly acknowledge they performed expedited (not full) review
HOTFIX:EXPEDITED_TESTING¶
MANDATORY_TESTS (all severities)¶
- SMOKE_TEST: does the fix resolve the reported issue?
- REGRESSION_TEST: does the fix break the happy path of affected feature?
- INTEGRATION_TEST: does the fix work with adjacent services? (API contracts, DB queries)
CONDITIONAL_TESTS¶
IF: database change THEN: test migration up AND down, verify data integrity
IF: API change THEN: test with actual client payloads (from incident logs)
IF: auth/security change THEN: test auth flow end-to-end, verify no bypass
IF: performance fix THEN: run targeted load test (see performance domain)
SKIP_WITH_JUSTIFICATION (SEV1 only)¶
- Full regression suite (run post-deploy instead)
- Cross-browser testing (run post-deploy)
- Accessibility testing (run post-deploy)
TOOL: run smoke test for specific feature
RUN: npm run test -- --grep "feature-name" --bail
TOOL: run affected integration tests
RUN: npm run test:integration -- --grep "feature-name"
ANTI_PATTERN: skipping ALL tests because "it's urgent"
FIX: smoke test takes < 2 minutes — always run it
ANTI_PATTERN: testing only in dev environment
FIX: if staging exists, test there — production-like environment catches config-dependent bugs
HOTFIX:ROLLBACK_DECISIONS¶
ROLLBACK_VS_FORWARD_FIX¶
IF: root cause unknown AND impact ongoing THEN: ROLLBACK immediately
IF: root cause known AND fix is < 30min THEN: forward-fix (faster than rollback + redeploy)
IF: rollback would cause data loss THEN: forward-fix (never lose data)
IF: rollback would break data migrations already applied THEN: forward-fix with migration fix
IF: multiple services affected AND unclear which caused it THEN: rollback all recent deploys
IF: incident has been active > 1 hour AND no fix in sight THEN: rollback
ROLLBACK_PROCEDURE¶
TOOL: rollback k8s deployment
RUN: kubectl rollout undo deployment/<name> -n <namespace>
RUN: kubectl rollout status deployment/<name> -n <namespace> --timeout=120s
TOOL: rollback with specific revision
RUN: kubectl rollout history deployment/<name> -n <namespace>
RUN: kubectl rollout undo deployment/<name> -n <namespace> --to-revision=<N>
TOOL: rollback database migration
RUN: npx drizzle-kit drop (CAUTION: verify what will be dropped)
NOTE: prefer forward migration that undoes the change over destructive rollback
CHECK: after rollback, verify the issue is resolved
CHECK: after rollback, verify no data was lost or corrupted
CHECK: after rollback, notify team that rollback was performed
RULE: every rollback must be documented with reason and verification
RULE: if rollback fails, escalate immediately — do not retry blindly
HOTFIX:DEPLOYMENT_CHECKLIST¶
PRE_DEPLOY:
- [ ] Hotfix branch created from main
- [ ] Fix implemented and locally tested
- [ ] Expedited code review passed
- [ ] Smoke test passed
- [ ] Regression test passed (at minimum, happy path)
- [ ] Incident commander (mira) approved deployment
- [ ] Rollback plan documented (which revision to undo to)
DEPLOY:
- [ ] Build new container image
- [ ] IF: executor change THEN: bash ge-ops/infrastructure/local/k3s/executor/build-executor.sh
- [ ] IF: admin-ui change THEN: rebuild admin-ui image
- [ ] IF: orchestrator change THEN: rebuild ge-orchestrator image
- [ ] Deploy to k8s: kubectl rollout restart deployment/<name> -n <namespace>
- [ ] Watch rollout: kubectl rollout status deployment/<name> -n <namespace> --timeout=120s
- [ ] Verify pods are running: kubectl get pods -n <namespace> -l app=<name>
POST_DEPLOY:
- [ ] Verify fix resolves the incident (test from client perspective)
- [ ] Monitor error rates for 15 minutes
- [ ] Monitor performance metrics for 15 minutes
- [ ] IF: all clear THEN: update incident status to "mitigated" or "resolved"
- [ ] Merge hotfix branch to main AND develop
- [ ] Delete hotfix branch
- [ ] Run full regression suite (async, results reviewed within 24hr)
- [ ] Create follow-up ticket for any skipped review/test items
- [ ] Update incident record with resolution details
HOTFIX:POST_VERIFICATION¶
DURATION: 24-hour stability window after hotfix deploy
HOUR_1:
- CHECK: error rate returned to baseline
- CHECK: response times returned to baseline
- CHECK: no new error patterns in logs
- CHECK: affected clients can use the system normally
HOUR_4:
- CHECK: no recurrence of the issue
- CHECK: full regression suite results reviewed
HOUR_24:
- CHECK: no delayed side effects
- CHECK: monitoring confirms sustained fix
- THEN: declare hotfix stable
- THEN: close incident if post-mortem is scheduled/complete