DOMAIN:INCIDENT_RESPONSE:PITFALLS¶
OWNER: mira
UPDATED: 2026-03-24
SCOPE: common incident response mistakes and how to avoid them
PITFALL:HERO_DEBUGGING¶
ANTI_PATTERN: one person goes dark, debugging alone, no communication for 30+ minutes
FIX: incident commander enforces 15-minute check-ins
FIX: share screen / share findings in incident channel continuously
RULE: if you haven't made progress in 15 minutes, say so — someone else may have insight
PITFALL:CHANGING_TWO_THINGS_AT_ONCE¶
ANTI_PATTERN: applying multiple fixes simultaneously — if it works, you don't know which fixed it
FIX: change one thing, test, then change the next if needed
EXCEPTION: SEV1 with known compound cause — fix both, verify, document both
PITFALL:DEBUGGING_IN_PRODUCTION_WITHOUT_SAFETY¶
ANTI_PATTERN: running experimental queries/commands directly on production
FIX: use read-only connections for investigation
FIX: test fixes on staging first (even under time pressure, 5 min extra saves hours)
FIX: if production exec is necessary, pair with another person watching
RULE: NEVER run DELETE/UPDATE without WHERE clause on production, EVER
PITFALL:FORGETTING_TO_COMMUNICATE¶
ANTI_PATTERN: resolving the incident but not telling anyone
FIX: status updates at regular intervals (15min SEV1, 30min SEV2)
FIX: explicit "resolved" announcement with summary
FIX: client notification of resolution
PITFALL:SKIPPING_POST_MORTEM¶
ANTI_PATTERN: "we fixed it, let's move on" — no post-mortem, no learning
FIX: post-mortem is MANDATORY for SEV1/SEV2
FIX: schedule within 48 hours while memory is fresh
FIX: mira tracks post-mortem completion
PITFALL:BLAME_CULTURE¶
ANTI_PATTERN: "who did this?" in the incident channel
FIX: blameless culture — focus on systems, not individuals
FIX: ask "what allowed this to happen?" not "who caused this?"
FIX: post-mortem focuses on systemic improvements, not punishment
PITFALL:ALERT_FATIGUE¶
ANTI_PATTERN: hundreds of alerts firing, all ignored because "they always fire"
FIX: every alert must be actionable — if you can't act on it, delete it
FIX: tune thresholds based on actual incidents, not hypothetical ones
FIX: review alert signal-to-noise ratio monthly
PITFALL:NO_ROLLBACK_PLAN¶
ANTI_PATTERN: deploying a hotfix without knowing how to undo it
FIX: document rollback plan BEFORE deploying
FIX: know which k8s revision to roll back to
FIX: know if database migrations are reversible
RULE: if rollback is impossible, the deploy needs extra scrutiny
PITFALL:INCIDENT_TUNNEL_VISION¶
ANTI_PATTERN: fixating on one hypothesis and ignoring contradicting evidence
FIX: after 15 minutes on one track, explicitly check if evidence supports your hypothesis
FIX: have someone else review your evidence independently
FIX: check the simple things: is the service running? is DNS resolving? is the cert valid?
PITFALL:LOSING_INCIDENT_CONTEXT¶
ANTI_PATTERN: debugging in Slack threads that disappear, no permanent record
FIX: incident record started immediately (even if sparse)
FIX: key findings and decisions logged with timestamps
FIX: post-mortem fills in the gaps while memory is fresh
PITFALL:NOT_TESTING_THE_FIX¶
ANTI_PATTERN: "I think I fixed it" without verification
FIX: verify from the client's perspective (not just server-side)
FIX: verify with the same scenario that triggered the incident
FIX: monitor for 15 minutes post-fix minimum
PITFALL:CASCADING_INCIDENT_RESPONSE¶
ANTI_PATTERN: incident response actions cause a second incident (e.g., restart that triggers data loss)
FIX: assess impact of response actions before executing
FIX: prefer reversible actions (rollback) over irreversible ones (data operations)
FIX: incident commander approves any action that could cause additional impact