DOMAIN:INCIDENT_RESPONSE¶

OWNER: mira
UPDATED: 2026-03-24
SCOPE: all client projects, all teams, GE platform itself
AGENTS: mira (commander+patterns), sandro (backend hotfix), tobias (frontend hotfix)

INCIDENT:LIFECYCLE¶

PHASES: detect → classify → respond → mitigate → resolve → post-mortem → pattern-feed

DETECT¶

SOURCES: monitoring alerts, client reports, automated health checks, agent self-reports
CHECK: is this a real incident or a false positive?
IF: alert from monitoring (Grafana/Loki/health-dump) THEN: verify with second signal before escalating
IF: client report THEN: treat as real until proven otherwise — client perception IS the incident
IF: agent self-report (executor failure, Redis timeout) THEN: check if isolated or systemic
RULE: detection-to-acknowledge target = 5 minutes (SEV1), 15 minutes (SEV2), 1 hour (SEV3), 4 hours (SEV4)

CLASSIFY¶

SEVERITY: SEV1 — CRITICAL
- Platform-wide outage (all clients affected)
- Data loss or data corruption confirmed
- Security breach confirmed (unauthorized access to client data)
- Payment processing failure
- CRITERIA: revenue impact > 0 OR data integrity compromised OR security boundary violated
- SLA: acknowledge 5min, respond 15min, mitigate 1hr, resolve 4hr
- COMMUNICATION: immediate stakeholder notification, 15min status updates

SEVERITY: SEV2 — HIGH
- Single client fully blocked (cannot use their application)
- Core feature broken for subset of clients (>10%)
- Performance degradation >5x normal (p95 > 5s)
- Agent pipeline halted (no work being processed)
- CRITERIA: client work blocked OR significant degradation
- SLA: acknowledge 15min, respond 30min, mitigate 4hr, resolve 24hr
- COMMUNICATION: affected client notification, 30min status updates

SEVERITY: SEV3 — MODERATE
- Non-critical feature broken for single client
- Performance degradation 2-5x normal
- Single agent stuck or failing intermittently
- Non-blocking UI issues in production
- CRITERIA: degraded but workaround exists
- SLA: acknowledge 1hr, respond 4hr, mitigate 24hr, resolve 72hr
- COMMUNICATION: track in incident log, daily status update

SEVERITY: SEV4 — LOW
- Cosmetic issues
- Intermittent errors with automatic recovery
- Non-user-facing system warnings
- CRITERIA: no client impact, no data risk
- SLA: acknowledge 4hr, resolve 1 week
- COMMUNICATION: track in incident log, resolve in normal sprint

ANTI_PATTERN: classifying everything as SEV1 — creates alert fatigue, erodes trust
FIX: use the CRITERIA fields above — if it doesn't match, it's not that severity

RESPOND (Incident Commander: Mira)¶

FIRST_5_MINUTES¶

ACKNOWLEDGE incident in system (timestamp matters for SLA)
CLASSIFY severity using criteria above
IF: SEV1/SEV2 THEN: page Team Zulu (sandro + tobias)
OPEN incident channel/thread — all communication goes here
STATE the current known facts: what's broken, who's affected, since when
ASSIGN roles: commander (mira), backend (sandro), frontend (tobias), comms (mira)

FIRST_15_MINUTES¶

GATHER data: error logs, metrics, recent deployments, recent config changes
FORM hypothesis: what changed? deploy? config? external dependency?
DECIDE: rollback vs forward-fix vs partial mitigation
IF: cause unclear AND impact ongoing THEN: rollback last deployment immediately
COMMUNICATE: first status update to stakeholders with ETA for next update

FIRST_HOUR¶

IF: mitigated THEN: document mitigation, shift to root cause analysis
IF: not mitigated THEN: escalate — bring in additional expertise
TRACK all actions taken with timestamps (compliance evidence)
IF: SEV1 still active at 1hr THEN: notify Dirk-Jan directly

RULE: incident commander does NOT debug — commander coordinates, communicates, decides
RULE: every action and decision gets timestamped in the incident record
RULE: "I don't know" is a valid status — never guess in incident comms

MITIGATE¶

GOAL: stop the bleeding — reduce client impact to acceptable level
MITIGATION_OPTIONS (fastest first):
1. Rollback to last known good (< 5 min if deployment was recent)
2. Feature flag disable (< 1 min if feature-flagged)
3. Traffic shift / DNS failover (< 5 min)
4. Scale out to absorb load (< 10 min)
5. Hotfix deploy (30 min - 2 hr, see hotfix procedures)
6. Data repair (time varies, requires careful planning)

RULE: mitigation that introduces new risk must be approved by incident commander
RULE: document what mitigation was applied — needed for post-mortem and compliance

RESOLVE¶

GOAL: root cause eliminated, system back to full normal operation
CHECK: is the fix permanent or temporary?
IF: temporary fix THEN: create follow-up ticket for permanent fix with deadline
IF: permanent fix THEN: verify with monitoring for 24hr stability window
THEN: close incident with resolution summary
THEN: schedule post-mortem within 48 hours

POST_MORTEM¶

FORMAT: blameless — focus on systems, not individuals
TEMPLATE:

# Post-Mortem: [INCIDENT-ID]
Date: [date]
Severity: [SEV level]
Duration: [detect to resolve]
Commander: [who]

## Summary
[2-3 sentences: what happened, who was affected, how long]

## Timeline
[HH:MM] — [event/action/decision]
(every significant event, timestamped)

## Root Cause
[technical root cause — be specific]

## Contributing Factors
- [factor 1 — why did this slip through?]
- [factor 2 — why wasn't it caught earlier?]

## Impact
- Clients affected: [count/names]
- Duration of impact: [time]
- Data impact: [none/description]

## What Went Well
- [thing that worked]

## What Went Wrong
- [thing that failed or was slow]

## Action Items
| Action | Owner | Deadline | Status |
|--------|-------|----------|--------|
| [specific action] | [agent] | [date] | open |

## Lessons Learned
[feed into wiki brain as learnings]

RULE: post-mortem must happen within 48 hours while memory is fresh
RULE: action items must have owners and deadlines — no orphan actions
RULE: every SEV1/SEV2 post-mortem produces at least one learning for the wiki brain
ANTI_PATTERN: "we'll be more careful next time" as an action item
FIX: systemic fix — add a test, add monitoring, add a gate, change a process

INCIDENT:COMMUNICATION¶

INTERNAL_COMMS¶

IF: SEV1/SEV2 THEN:
- Notify all active agents via Redis broadcast
- Post to incident channel with severity, impact, ETA
- Update every 15 minutes (SEV1) or 30 minutes (SEV2)
- Include: current status, actions in progress, next update time

IF: SEV3/SEV4 THEN:
- Track in incident log
- Notify relevant team only
- Daily summary updates

TEMPLATE_STATUS_UPDATE:

INCIDENT: [ID] | SEV: [level] | STATUS: [investigating|mitigating|resolved]
IMPACT: [who is affected and how]
CURRENT ACTION: [what we're doing right now]
NEXT UPDATE: [time]

CLIENT_COMMS¶

RULE: client-facing communication is factual, empathetic, and free of technical jargon
RULE: never blame the client, even if they triggered the issue
RULE: always include what we're doing about it and when they'll hear next

TEMPLATE_CLIENT_NOTIFICATION:

We've identified an issue affecting [description of impact].
Our team is actively working on a resolution.
We expect to have an update by [time].
We apologize for the inconvenience.

ANTI_PATTERN: radio silence during an incident
FIX: even "we're still investigating, no new info" is better than silence

INCIDENT:RECORD_FORMAT¶

PURPOSE: compliance evidence (ISO 27001 A.5.24, A.5.25, A.5.26, A.5.27)

REQUIRED_FIELDS:
- incident_id: unique identifier (INC-YYYY-NNNN)
- detected_at: ISO 8601 timestamp
- detected_by: source (monitoring/client/agent/manual)
- acknowledged_at: ISO 8601 timestamp
- severity: SEV1-4
- classification: outage | degradation | security | data | config
- affected_clients: list
- affected_services: list
- commander: agent name
- responders: list of agent names
- timeline: ordered list of timestamped events
- root_cause: text
- resolution: text
- resolved_at: ISO 8601 timestamp
- post_mortem_url: link to post-mortem
- action_items: list with owners and deadlines
- sla_met: boolean per SLA metric

STORAGE: PostgreSQL incidents table (SSOT), wiki post-mortem page (human-readable)

INCIDENT:SLA_MEASUREMENT¶

METRIC: time_to_acknowledge = acknowledged_at - detected_at
METRIC: time_to_respond = first_responder_active_at - detected_at
METRIC: time_to_mitigate = impact_reduced_at - detected_at
METRIC: time_to_resolve = resolved_at - detected_at

REPORT: monthly SLA compliance report
- % incidents acknowledged within SLA by severity
- % incidents mitigated within SLA by severity
- % incidents resolved within SLA by severity
- mean time to detect (MTTD)
- mean time to resolve (MTTR)
- incident count by severity trend

RULE: SLA clock starts at detection, not acknowledgement
RULE: SLA clock pauses only if explicitly waiting on client input (document the pause)