DOMAIN:INCIDENT_RESPONSE:HOTFIX_PROCEDURES¶

OWNER: sandro (backend), tobias (frontend)
UPDATED: 2026-03-24
SCOPE: production hotfixes during and after incidents
AGENTS: sandro, tobias, mira (approval)

HOTFIX:BRANCHING¶

RULE: hotfix branches from main (production), NEVER from develop
RULE: branch naming: hotfix/INC-YYYY-NNNN-short-description
RULE: after merge to main, ALSO merge to develop (prevent regression on next release)

FLOW:

main ──────┬──────────── merge hotfix ──────── main
           │                ↑
           └── hotfix/INC-* ┘
                            │
develop ────────────────── merge hotfix ────── develop

TOOL: create hotfix branch
RUN: git checkout main && git pull origin main && git checkout -b hotfix/INC-YYYY-NNNN-description

TOOL: complete hotfix merge
RUN: git checkout main && git merge --no-ff hotfix/INC-YYYY-NNNN-description
RUN: git checkout develop && git merge --no-ff hotfix/INC-YYYY-NNNN-description
RUN: git branch -d hotfix/INC-YYYY-NNNN-description

ANTI_PATTERN: branching from develop for hotfix — includes unreleased features, unpredictable behavior
FIX: always branch from main — main IS production

ANTI_PATTERN: forgetting to merge hotfix back to develop
FIX: hotfix checklist includes develop merge as mandatory step

HOTFIX:EXPEDITED_CODE_REVIEW¶

RULE: hotfixes MUST still be reviewed — but review scope is narrowed

MUST_CHECK (non-negotiable even under time pressure)¶

[ ] Does the fix address the root cause (not just symptoms)?
[ ] Does it introduce any security risk? (injection, auth bypass, data exposure)
[ ] Does it affect data integrity? (migrations, writes, deletes)
[ ] Is the change scoped tightly to the incident? (no unrelated changes bundled)
[ ] Are there any obvious performance regressions? (N+1 queries, missing indexes)

MAY_SKIP (during SEV1/SEV2 only, must be addressed in follow-up)¶

Code style / formatting nits
Test coverage for edge cases (core path test still required)
Documentation updates
Refactoring suggestions
Type completeness (if TypeScript, basic types still required)

REVIEW_FLOW¶

IF: SEV1 THEN: single reviewer (mira or senior dev), review during PR creation, 15min max
IF: SEV2 THEN: single reviewer, 30min max
IF: SEV3/SEV4 THEN: normal review process

RULE: expedited review generates a follow-up ticket for full review within 1 week
RULE: reviewer must explicitly acknowledge they performed expedited (not full) review

HOTFIX:EXPEDITED_TESTING¶

MANDATORY_TESTS (all severities)¶

SMOKE_TEST: does the fix resolve the reported issue?
REGRESSION_TEST: does the fix break the happy path of affected feature?
INTEGRATION_TEST: does the fix work with adjacent services? (API contracts, DB queries)

CONDITIONAL_TESTS¶

IF: database change THEN: test migration up AND down, verify data integrity
IF: API change THEN: test with actual client payloads (from incident logs)
IF: auth/security change THEN: test auth flow end-to-end, verify no bypass
IF: performance fix THEN: run targeted load test (see performance domain)

SKIP_WITH_JUSTIFICATION (SEV1 only)¶

Full regression suite (run post-deploy instead)
Cross-browser testing (run post-deploy)
Accessibility testing (run post-deploy)

TOOL: run smoke test for specific feature
RUN: npm run test -- --grep "feature-name" --bail

TOOL: run affected integration tests
RUN: npm run test:integration -- --grep "feature-name"

ANTI_PATTERN: skipping ALL tests because "it's urgent"
FIX: smoke test takes < 2 minutes — always run it

ANTI_PATTERN: testing only in dev environment
FIX: if staging exists, test there — production-like environment catches config-dependent bugs

HOTFIX:ROLLBACK_DECISIONS¶

ROLLBACK_VS_FORWARD_FIX¶

IF: root cause unknown AND impact ongoing THEN: ROLLBACK immediately
IF: root cause known AND fix is < 30min THEN: forward-fix (faster than rollback + redeploy)
IF: rollback would cause data loss THEN: forward-fix (never lose data)
IF: rollback would break data migrations already applied THEN: forward-fix with migration fix
IF: multiple services affected AND unclear which caused it THEN: rollback all recent deploys
IF: incident has been active > 1 hour AND no fix in sight THEN: rollback

ROLLBACK_PROCEDURE¶

TOOL: rollback k8s deployment
RUN: kubectl rollout undo deployment/<name> -n <namespace>
RUN: kubectl rollout status deployment/<name> -n <namespace> --timeout=120s

TOOL: rollback with specific revision
RUN: kubectl rollout history deployment/<name> -n <namespace>
RUN: kubectl rollout undo deployment/<name> -n <namespace> --to-revision=<N>

TOOL: rollback database migration
RUN: npx drizzle-kit drop (CAUTION: verify what will be dropped)
NOTE: prefer forward migration that undoes the change over destructive rollback

CHECK: after rollback, verify the issue is resolved
CHECK: after rollback, verify no data was lost or corrupted
CHECK: after rollback, notify team that rollback was performed

RULE: every rollback must be documented with reason and verification
RULE: if rollback fails, escalate immediately — do not retry blindly

HOTFIX:DEPLOYMENT_CHECKLIST¶

PRE_DEPLOY:
- [ ] Hotfix branch created from main
- [ ] Fix implemented and locally tested
- [ ] Expedited code review passed
- [ ] Smoke test passed
- [ ] Regression test passed (at minimum, happy path)
- [ ] Incident commander (mira) approved deployment
- [ ] Rollback plan documented (which revision to undo to)

DEPLOY:
- [ ] Build new container image
- [ ] IF: executor change THEN: bash ge-ops/infrastructure/local/k3s/executor/build-executor.sh
- [ ] IF: admin-ui change THEN: rebuild admin-ui image
- [ ] IF: orchestrator change THEN: rebuild ge-orchestrator image
- [ ] Deploy to k8s: kubectl rollout restart deployment/<name> -n <namespace>
- [ ] Watch rollout: kubectl rollout status deployment/<name> -n <namespace> --timeout=120s
- [ ] Verify pods are running: kubectl get pods -n <namespace> -l app=<name>

POST_DEPLOY:
- [ ] Verify fix resolves the incident (test from client perspective)
- [ ] Monitor error rates for 15 minutes
- [ ] Monitor performance metrics for 15 minutes
- [ ] IF: all clear THEN: update incident status to "mitigated" or "resolved"
- [ ] Merge hotfix branch to main AND develop
- [ ] Delete hotfix branch
- [ ] Run full regression suite (async, results reviewed within 24hr)
- [ ] Create follow-up ticket for any skipped review/test items
- [ ] Update incident record with resolution details

HOTFIX:POST_VERIFICATION¶

DURATION: 24-hour stability window after hotfix deploy

HOUR_1:
- CHECK: error rate returned to baseline
- CHECK: response times returned to baseline
- CHECK: no new error patterns in logs
- CHECK: affected clients can use the system normally

HOUR_4:
- CHECK: no recurrence of the issue
- CHECK: full regression suite results reviewed

HOUR_24:
- CHECK: no delayed side effects
- CHECK: monitoring confirms sustained fix
- THEN: declare hotfix stable
- THEN: close incident if post-mortem is scheduled/complete