Skip to content

DOMAIN:INCIDENT_RESPONSE:PATTERN_ANALYSIS

OWNER: mira
UPDATED: 2026-03-24
SCOPE: cross-client incident pattern detection, platform-wide root cause analysis
AGENTS: mira (primary), nessa (performance patterns)


PATTERNS:INCIDENT_DATABASE_STRUCTURE

PURPOSE: enable pattern detection across incidents, clients, and time periods

SCHEMA

TABLE: incidents
- incident_id (PK): INC-YYYY-NNNN
- detected_at: timestamp
- resolved_at: timestamp
- severity: enum(SEV1, SEV2, SEV3, SEV4)
- classification: enum(outage, degradation, security, data, config, performance)
- root_cause_category: enum (see below)
- error_fingerprint: text (hash of error signature)
- affected_clients: jsonb (array of client IDs)
- affected_services: jsonb (array of service names)
- affected_components: jsonb (array: frontend/backend/database/redis/infra)
- platform_pattern: boolean (true if 2+ clients affected by same root cause)
- resolution_type: enum(rollback, hotfix, config_change, data_repair, external_resolution)
- time_to_detect: interval
- time_to_mitigate: interval
- time_to_resolve: interval
- action_items: jsonb
- post_mortem_id: FK to post_mortems

ROOT_CAUSE_CATEGORIES:
- code_defect: bug in application code
- config_error: misconfiguration (env vars, feature flags, routing)
- dependency_failure: external service/API outage
- infrastructure: k8s, DNS, certificates, networking
- data_integrity: corrupt/missing/inconsistent data
- performance: resource exhaustion, slow queries, memory leaks
- security: unauthorized access, vulnerability exploitation
- human_error: manual action caused issue
- unknown: root cause not determined

INDEXES_FOR_PATTERN_QUERIES

INDEX: (error_fingerprint) — dedup and frequency analysis
INDEX: (root_cause_category, detected_at) — trend analysis by cause
INDEX: (classification, severity) — severity distribution
INDEX: (affected_services, detected_at) — per-service reliability
INDEX: (platform_pattern, detected_at) — platform issues over time


PATTERNS:ERROR_FINGERPRINTING

PURPOSE: deduplicate errors, detect recurring issues, track fix effectiveness

FINGERPRINT_ALGORITHM

INPUT: error type + normalized message + stack trace top 3 frames
NORMALIZATION:
1. Strip variable data: IDs, timestamps, UUIDs, IP addresses, session tokens
2. Strip line numbers (code changes shift them)
3. Keep: error class/type, function names, file paths (relative)
4. Hash with SHA-256, store first 16 chars as fingerprint

EXAMPLE:

RAW: "TypeError: Cannot read property 'email' of null at UserService.getProfile (user-service.ts:47) at async handler (route.ts:12)"
NORMALIZED: "TypeError:Cannot read property of null|UserService.getProfile|user-service.ts|handler|route.ts"
FINGERPRINT: "a3f2b891c4e7d012"

RULE: same fingerprint = same root cause (with high confidence)
RULE: different fingerprint but same root_cause_category = related but distinct issues
RULE: fingerprint reappearing after fix = fix was incomplete (escalate)

DEDUP_LOGIC

IF: new incident has same error_fingerprint as recent incident (< 7 days) THEN:
- CHECK: was the previous incident resolved?
- IF: yes THEN: flag as regression — fix didn't hold
- IF: no THEN: merge into existing incident — same ongoing issue

IF: new incident has same fingerprint as old incident (> 7 days) THEN:
- CREATE new incident but link to previous
- FLAG for pattern analysis — recurring issue needs systemic fix


PATTERNS:PLATFORM_PATTERN_IDENTIFICATION

PURPOSE: detect when multiple clients hit the same root cause

DETECTION_RULES

RULE: 2+ clients with same error_fingerprint within 24 hours = platform pattern
RULE: 2+ clients with same root_cause_category + same affected_service within 48 hours = probable platform pattern
RULE: 3+ incidents of same classification in same service within 1 week = service reliability pattern

ANALYSIS_WORKFLOW

ON_PLATFORM_PATTERN_DETECTED:
1. ESCALATE to SEV2 minimum (platform patterns affect all clients potentially)
2. IDENTIFY all clients that could be affected (not just those who reported)
3. PROACTIVE CHECK: test other clients for same issue
4. ROOT CAUSE: must be in shared infrastructure/code, not client-specific config
5. FIX SCOPE: fix must be applied platform-wide, not per-client

PATTERN_CATEGORIES

CATEGORY: deployment_regression
IF: incidents cluster within 2 hours after a deployment THEN: deployment caused it
ACTION: correlate incident timestamps with deployment history
ACTION: identify which change caused the regression
ACTION: add regression test covering this case

CATEGORY: load_threshold
IF: incidents correlate with traffic peaks THEN: capacity issue
ACTION: identify the bottleneck (CPU, memory, connections, queries)
ACTION: set scaling thresholds below the failure point
ACTION: add load test covering this traffic level

CATEGORY: dependency_cascade
IF: incidents across services start within minutes of each other THEN: cascade failure
ACTION: identify the root dependency that failed
ACTION: add circuit breakers and fallbacks
ACTION: add health checks for the dependency

CATEGORY: configuration_drift
IF: same issue appears in different clients with different timing THEN: config drift
ACTION: audit all client configurations for the relevant setting
ACTION: add config validation at startup
ACTION: standardize the configuration

CATEGORY: temporal_pattern
IF: incidents correlate with time of day/week/month THEN: temporal trigger
ACTION: identify what happens at that time (cron jobs, backups, traffic patterns)
ACTION: adjust scheduled operations or scale preemptively


PATTERNS:FEEDING_BACK_TO_DEVELOPMENT

PURPOSE: incidents inform specs, tests, and release gates — closing the loop

FEEDBACK_FLOW

INCIDENT → POST_MORTEM → ACTION_ITEMS → {spec_update, test_addition, gate_addition}

ON_RECURRING_PATTERN (same root_cause_category, 3+ times):
1. CREATE evolution entry: ge-ops/bin/evo-new.sh 'Pattern: <description>' high incident-response
2. UPDATE spec template to include check for this failure mode
3. ADD test case to regression suite covering this failure mode
4. IF: detectable pre-release THEN: add to release gate checklist
5. ADD to wiki brain as pitfall: ge-ops/wiki/docs/development/pitfalls/

SPEC_INTEGRATION

RULE: every post-mortem action item tagged "spec" must result in updated acceptance criteria
FORMAT: "GIVEN [precondition from incident] WHEN [trigger from incident] THEN [expected behavior, not the bug]"

TEST_INTEGRATION

RULE: every post-mortem action item tagged "test" must result in automated test
TEST_TYPES:
- Unit test: for code-level defects
- Integration test: for service interaction defects
- Load test scenario: for performance/capacity defects
- Chaos test: for infrastructure resilience defects

RELEASE_GATE_INTEGRATION

RULE: every post-mortem action item tagged "gate" must result in CI/CD check
GATE_TYPES:
- Pre-deploy: lint rule, static analysis rule, test assertion
- Post-deploy: smoke test, canary metric threshold
- Pre-release: performance budget check, security scan


PATTERNS:ISO_27001_A527_EVIDENCE

STANDARD: ISO 27001:2022 Annex A Control 5.27 — Learning from Information Security Incidents
REQUIREMENT: organization shall use knowledge gained from incidents to strengthen controls

EVIDENCE_FORMAT

DOCUMENT: Incident Learning Report (quarterly)
CONTENTS:
1. INCIDENT_SUMMARY: count by severity, classification, root_cause_category
2. TREND_ANALYSIS: comparison with previous quarter
3. PATTERN_IDENTIFICATION: recurring issues, platform patterns
4. CONTROL_IMPROVEMENTS: what was changed as a result of incidents
5. EFFECTIVENESS_MEASUREMENT: did the changes reduce recurrence?
6. RESIDUAL_RISK: known patterns without fixes yet (with justification)

STORAGE: PostgreSQL (structured data) + wiki (human-readable quarterly report)

REQUIRED_EVIDENCE_PER_INCIDENT (SEV1/SEV2)

  • Incident record with all required fields (see index.md)
  • Post-mortem document
  • Action items with completion tracking
  • Before/after evidence for control improvements
  • Test results proving the fix

AUDIT_TRAIL

RULE: incident records are immutable once closed (append-only updates)
RULE: all incident actions are timestamped and attributed
RULE: post-mortem action items are tracked to completion
RULE: quarterly learning report is produced and reviewed by Dirk-Jan

ANTI_PATTERN: closing incidents without action items
FIX: every SEV1/SEV2 must have at least one systemic action item

ANTI_PATTERN: action items that stay open indefinitely
FIX: action items have deadlines, overdue items escalate automatically