Skip to content

DOMAIN:MONITORING — ALERTING

OWNER: eltjo, annegreet
ALSO_USED_BY: ron, mira, nessa
UPDATED: 2026-03-26
SCOPE: alert severity, escalation chains, on-call patterns, alert fatigue prevention, SLO-based alerting, incident correlation


ALERTING:CORE_PRINCIPLE

PURPOSE: notify the right people about the right problems at the right time with enough context to act

RULE: every alert MUST be actionable — if you cannot act on it, it is a metric, not an alert
RULE: every alert MUST have a severity classification
RULE: every alert MUST link to a runbook or remediation steps
RULE: alerts target symptoms (user impact), not causes (internal metrics)

CHECK: did the last 10 alerts each result in a human taking action?
IF: < 80% actionable THEN: alert quality is poor — review and prune
IF: >= 80% actionable THEN: alert quality is healthy


ALERTING:SEVERITY_LEVELS

SEVERITY_DEFINITIONS

PURPOSE: classify alerts so that response urgency matches impact

SEVERITY_MATRIX:

CRITICAL (P1):  
  IMPACT: system-wide outage, data loss risk, security breach  
  RESPONSE: immediate, wake people up, all hands  
  NOTIFICATION: page on-call + backup + mira escalation  
  SLA: acknowledge within 5 minutes, resolve within 1 hour  
  EXAMPLES:  
    - PostgreSQL down (SSOT unavailable)  
    - Redis down (all task dispatch halted)  
    - secret exposed in logs  
    - cost_gate bypassed  
    - network policy removed  

HIGH (P2):  
  IMPACT: partial outage, degraded service, single critical component down  
  RESPONSE: within 15 minutes during business hours, within 1 hour off-hours  
  NOTIFICATION: notify ron + mira, create incident  
  SLA: acknowledge within 15 minutes, resolve within 4 hours  
  EXAMPLES:  
    - orchestrator single replica down (HA degraded)  
    - executor crashlooping (> 10 restarts/hour)  
    - Redis memory > 80%  
    - stream length exceeds MAXLEN  
    - agent status changed unexpectedly  

MEDIUM (P3):  
  IMPACT: non-critical degradation, performance issue, configuration concern  
  RESPONSE: during next business hours  
  NOTIFICATION: daily digest to ron, no page  
  SLA: acknowledge within 4 hours, resolve within 24 hours  
  EXAMPLES:  
    - resource limits differ from manifest  
    - CronJob schedule changed  
    - certificate expiry < 30 days  
    - error budget burn rate > 2x  
    - secret rotation overdue  

LOW (P4):  
  IMPACT: informational, cosmetic, optimization opportunity  
  RESPONSE: at convenience, batch with other work  
  NOTIFICATION: weekly summary only  
  SLA: acknowledge within 24 hours, resolve within 1 week  
  EXAMPLES:  
    - label/annotation drift  
    - log level changed  
    - metadata field mismatch  
    - stale learning detected  

RULE: when in doubt, start at higher severity — you can always downgrade
RULE: severity can be auto-escalated if not acknowledged within SLA


ALERTING:ESCALATION_CHAINS

GE_ESCALATION_PATH

PURPOSE: define who gets notified and in what order

ESCALATION_CHAIN:

TIER 1 — Automated Response:  
  WHO: ge-orchestrator health loop, cost_gate.py  
  WHAT: auto-remediate LOW, suppress known false positives  
  TIMEOUT: immediate  

TIER 2 — System Monitor:  
  WHO: ron (System Integrity Monitor)  
  WHAT: investigate, classify, remediate or escalate  
  TIMEOUT: 15 minutes for CRITICAL, 1 hour for HIGH  

TIER 3 — Escalation Manager:  
  WHO: mira (Escalation Manager)  
  WHAT: coordinate response, notify human if needed  
  TIMEOUT: 30 minutes for CRITICAL, 4 hours for HIGH  

TIER 4 — Human:  
  WHO: Dirk-Jan (system owner)  
  WHAT: final decision authority, approve risky remediations  
  TIMEOUT: varies (human availability)  

RULE: CRITICAL alerts skip Tier 1 and go directly to Tier 2 + Tier 3 simultaneously
RULE: escalation happens automatically on timeout — do not wait for manual escalation
RULE: every escalation includes full context (what, when, impact, attempted fixes)

CHECK: is the escalation chain tested regularly?
IF: never tested THEN: it may be broken when you need it — test monthly

ANTI_PATTERN: all alerts go to the same person regardless of severity
FIX: tiered escalation ensures the right expertise and authority level

ANTI_PATTERN: escalation requires manual intervention to trigger
FIX: automatic escalation on SLA timeout — humans forget under pressure

ESCALATION_VIA_DISCUSSIONS

TOOL: admin-ui Discussions API
RUN: curl -X POST http://admin-ui.ge-system.svc.cluster.local/api/discussions -H "Authorization: Bearer $INTERNAL_API_TOKEN" -H "Content-Type: application/json" -d '{"title": "CRITICAL: Redis memory > 90%", "severity": "critical", "initiator": "ron", "participants": ["mira"], "context": {...}}'

RULE: CRITICAL and HIGH alerts create a discussion for audit trail
RULE: discussion includes: symptom, impact assessment, attempted remediation, recommended action
RULE: human approval required before destructive remediation (rollback, scale-down, halt)


ALERTING:ON_CALL_PATTERNS

AGENT_ON_CALL_ROTATION

PURPOSE: define which monitoring agent handles alerts at any given time

GE_ON_CALL_MODEL:

PRIMARY: ron (System Integrity Monitor) — always on-call for system checks  
SECONDARY: annegreet (Knowledge Curator) — on-call for learning pipeline issues  
ESCALATION: mira (Escalation Manager) — handles all escalations  
OBSERVER: eltjo (Cross-Session Analyst) — monitors patterns, not incidents  
INCIDENT: nessa (Incident Response) — activated for P1 incidents  

RULE: primary on-call handles all initial alerts
RULE: secondary on-call handles domain-specific alerts only
RULE: escalation manager is always available as backup

CHECK: is the on-call agent currently active (status != unavailable)?
IF: unavailable THEN: alerts have no handler — escalate to next tier immediately

HANDOFF_PROTOCOL

RULE: when an agent goes to maintenance, transfer active incidents to next in chain
RULE: handoff includes: active incidents, recent alerts, context for open investigations
RULE: log handoff to session_learnings for audit


ALERTING:ALERT_FATIGUE_PREVENTION

DEDUPLICATION

PURPOSE: prevent the same condition from generating multiple alerts

RULE: same check failing on consecutive runs = 1 alert, not N alerts
RULE: dedup key = check_name + resource_identifier
RULE: dedup window = 2x the check interval

TECHNIQUE:

1. alert fires: check dedup store for matching key  
2. if key exists and within window: suppress (increment count only)  
3. if key missing or window expired: emit alert, store key with TTL  
4. when condition resolves: emit resolution alert, clear key  

ANTI_PATTERN: every health check cycle produces a new alert for the same failure
FIX: dedup — one alert per condition, with a count of consecutive failures

FLAPPING_DETECTION

PURPOSE: detect and suppress alerts that fire and resolve repeatedly

RULE: if alert fires + resolves > 3 times in 1 hour, it is flapping
RULE: flapping alerts are suppressed and converted to a single "flapping" alert
RULE: investigate root cause of flapping — usually a threshold too close to normal value

CHECK: any alerts that fired and resolved > 3 times today?
IF: yes THEN: flapping — adjust threshold or add hysteresis

TECHNIQUE: hysteresis (different thresholds for fire and resolve)

FIRE when: redis_memory_percent > 80  
RESOLVE when: redis_memory_percent < 70  
(10% gap prevents flapping around the threshold)  

ALERT_CORRELATION

PURPOSE: group related alerts into a single incident

RULE: if Redis goes down, suppress all alerts that depend on Redis
RULE: correlation rule: if infrastructure alert fires, suppress downstream application alerts
RULE: present correlated alerts as one incident with a root cause

CORRELATION_RULES:

IF: redis_down THEN: suppress stream_length, consumer_lag, task_dispatch alerts  
IF: postgres_down THEN: suppress registry_sync, admin_ui_health, team_routing alerts  
IF: node_disk_full THEN: suppress all pod alerts on that node  
IF: network_policy_removed THEN: suppress specific connectivity alerts  

ANTI_PATTERN: 20 alerts fire simultaneously because Redis is down
FIX: root cause correlation — emit 1 alert for Redis, suppress the 19 symptoms

COOLDOWN_PERIODS

RULE: after an alert fires, minimum 15 minutes before it can re-fire
RULE: after an incident is resolved, minimum 30 minutes cooldown
RULE: cooldown prevents alert storms during recovery


ALERTING:SLO_BASED_ALERTING

ERROR_BUDGET_BURN_RATE

PURPOSE: alert based on how fast the error budget is being consumed, not on individual failures

FORMULA:

burn_rate = (error_rate_in_window / (1 - SLO_target))  

Example (99.5% SLO):  
  if error_rate_in_window = 1%:  
  burn_rate = 0.01 / 0.005 = 2x  
  meaning: budget will exhaust in 15 days instead of 30  

ALERT_THRESHOLDS:

FAST_BURN (CRITICAL):  
  window: 5 minutes  
  burn_rate: > 14.4x (budget exhausts in ~2 days)  
  action: page immediately  

MEDIUM_BURN (HIGH):  
  window: 30 minutes  
  burn_rate: > 6x (budget exhausts in ~5 days)  
  action: alert during business hours  

SLOW_BURN (MEDIUM):  
  window: 6 hours  
  burn_rate: > 2x (budget exhausts in ~15 days)  
  action: daily digest, prioritize reliability work  

NORMAL:  
  burn_rate: <= 1x  
  action: no alert, budget on track  

CHECK: what is the current burn rate?
IF: > 14.4x THEN: CRITICAL — budget exhausting rapidly, likely active incident
IF: > 6x THEN: HIGH — budget under pressure, investigate
IF: > 2x THEN: MEDIUM — trend is concerning, plan reliability work

MULTI_WINDOW_VALIDATION

PURPOSE: reduce false positives by requiring the condition in both short and long windows

TECHNIQUE:

alert fires ONLY when BOTH conditions are true:  
  1. short_window (5 min) burn_rate exceeds threshold  
  2. long_window (1 hour) burn_rate exceeds threshold * 0.5  

This ensures:  
  - short spikes alone do not page (could be transient)  
  - sustained degradation does page (even at lower rate)  

ANTI_PATTERN: alerting on 1-minute error rate spikes
FIX: multi-window validation filters transient spikes from real problems


ALERTING:RUNBOOK_LINKED_ALERTS

RUNBOOK_STRUCTURE

PURPOSE: every alert links to a runbook that tells the responder what to do

REQUIRED_RUNBOOK_SECTIONS:

1. SYMPTOM: what the alert means in plain language  
2. IMPACT: what is affected and how severely  
3. INVESTIGATION: steps to diagnose root cause  
4. REMEDIATION: steps to fix the problem  
5. ESCALATION: when and how to escalate  
6. VERIFICATION: how to confirm the fix worked  
7. PREVENTION: what to change to prevent recurrence  

RULE: every CRITICAL and HIGH alert MUST have a runbook
RULE: runbooks live in wiki at wiki/docs/development/runbooks/
RULE: update runbooks after every incident — incorporate lessons learned

CHECK: does the alert include a link to its runbook?
IF: no THEN: responder wastes time figuring out what to do — add the link
IF: yes THEN: responder can start remediation immediately

ANTI_PATTERN: runbooks that say "investigate and fix"
FIX: runbooks must have specific, step-by-step instructions

ANTI_PATTERN: runbooks that are never updated
FIX: post-incident review MUST include runbook update if steps were wrong or missing


ALERTING:INCIDENT_CORRELATION

INCIDENT_LIFECYCLE

PURPOSE: track alerts from detection through resolution

LIFECYCLE:

1. DETECTED:  alert fires, incident created  
2. ACKNOWLEDGED: responder claims the incident  
3. INVESTIGATING: responder is diagnosing root cause  
4. MITIGATING: responder is applying fix  
5. RESOLVED: fix confirmed, alert cleared  
6. POST_MORTEM: root cause documented, prevention planned  

RULE: every P1 and P2 incident completes the full lifecycle
RULE: every P1 incident gets a post-mortem within 48 hours
RULE: post-mortem is blameless — focus on system improvements
RULE: post-mortem findings feed back to learning pipeline (session_learnings)

CHECK: are incidents being closed without post-mortem?
IF: yes THEN: learning opportunity missed — enforce post-mortem for P1/P2

TIMELINE_RECONSTRUCTION

PURPOSE: build a complete timeline of what happened during an incident

TECHNIQUE:

1. collect all alerts that fired within the incident window  
2. collect all logs with matching correlation_ids  
3. collect all metrics anomalies in the incident window  
4. order chronologically  
5. identify: triggering event → cascade → detection → response → resolution  

RULE: timeline includes both automated events (alerts, auto-remediation) and human actions
RULE: timeline is stored with the incident record for future reference

ANTI_PATTERN: post-mortem based on memory instead of data
FIX: reconstruct timeline from telemetry — memory is unreliable during incidents