DOMAIN:MONITORING — ALERTING¶
OWNER: eltjo, annegreet
ALSO_USED_BY: ron, mira, nessa
UPDATED: 2026-03-26
SCOPE: alert severity, escalation chains, on-call patterns, alert fatigue prevention, SLO-based alerting, incident correlation
ALERTING:CORE_PRINCIPLE¶
PURPOSE: notify the right people about the right problems at the right time with enough context to act
RULE: every alert MUST be actionable — if you cannot act on it, it is a metric, not an alert
RULE: every alert MUST have a severity classification
RULE: every alert MUST link to a runbook or remediation steps
RULE: alerts target symptoms (user impact), not causes (internal metrics)
CHECK: did the last 10 alerts each result in a human taking action?
IF: < 80% actionable THEN: alert quality is poor — review and prune
IF: >= 80% actionable THEN: alert quality is healthy
ALERTING:SEVERITY_LEVELS¶
SEVERITY_DEFINITIONS¶
PURPOSE: classify alerts so that response urgency matches impact
SEVERITY_MATRIX:
CRITICAL (P1):
IMPACT: system-wide outage, data loss risk, security breach
RESPONSE: immediate, wake people up, all hands
NOTIFICATION: page on-call + backup + mira escalation
SLA: acknowledge within 5 minutes, resolve within 1 hour
EXAMPLES:
- PostgreSQL down (SSOT unavailable)
- Redis down (all task dispatch halted)
- secret exposed in logs
- cost_gate bypassed
- network policy removed
HIGH (P2):
IMPACT: partial outage, degraded service, single critical component down
RESPONSE: within 15 minutes during business hours, within 1 hour off-hours
NOTIFICATION: notify ron + mira, create incident
SLA: acknowledge within 15 minutes, resolve within 4 hours
EXAMPLES:
- orchestrator single replica down (HA degraded)
- executor crashlooping (> 10 restarts/hour)
- Redis memory > 80%
- stream length exceeds MAXLEN
- agent status changed unexpectedly
MEDIUM (P3):
IMPACT: non-critical degradation, performance issue, configuration concern
RESPONSE: during next business hours
NOTIFICATION: daily digest to ron, no page
SLA: acknowledge within 4 hours, resolve within 24 hours
EXAMPLES:
- resource limits differ from manifest
- CronJob schedule changed
- certificate expiry < 30 days
- error budget burn rate > 2x
- secret rotation overdue
LOW (P4):
IMPACT: informational, cosmetic, optimization opportunity
RESPONSE: at convenience, batch with other work
NOTIFICATION: weekly summary only
SLA: acknowledge within 24 hours, resolve within 1 week
EXAMPLES:
- label/annotation drift
- log level changed
- metadata field mismatch
- stale learning detected
RULE: when in doubt, start at higher severity — you can always downgrade
RULE: severity can be auto-escalated if not acknowledged within SLA
ALERTING:ESCALATION_CHAINS¶
GE_ESCALATION_PATH¶
PURPOSE: define who gets notified and in what order
ESCALATION_CHAIN:
TIER 1 — Automated Response:
WHO: ge-orchestrator health loop, cost_gate.py
WHAT: auto-remediate LOW, suppress known false positives
TIMEOUT: immediate
TIER 2 — System Monitor:
WHO: ron (System Integrity Monitor)
WHAT: investigate, classify, remediate or escalate
TIMEOUT: 15 minutes for CRITICAL, 1 hour for HIGH
TIER 3 — Escalation Manager:
WHO: mira (Escalation Manager)
WHAT: coordinate response, notify human if needed
TIMEOUT: 30 minutes for CRITICAL, 4 hours for HIGH
TIER 4 — Human:
WHO: Dirk-Jan (system owner)
WHAT: final decision authority, approve risky remediations
TIMEOUT: varies (human availability)
RULE: CRITICAL alerts skip Tier 1 and go directly to Tier 2 + Tier 3 simultaneously
RULE: escalation happens automatically on timeout — do not wait for manual escalation
RULE: every escalation includes full context (what, when, impact, attempted fixes)
CHECK: is the escalation chain tested regularly?
IF: never tested THEN: it may be broken when you need it — test monthly
ANTI_PATTERN: all alerts go to the same person regardless of severity
FIX: tiered escalation ensures the right expertise and authority level
ANTI_PATTERN: escalation requires manual intervention to trigger
FIX: automatic escalation on SLA timeout — humans forget under pressure
ESCALATION_VIA_DISCUSSIONS¶
TOOL: admin-ui Discussions API
RUN: curl -X POST http://admin-ui.ge-system.svc.cluster.local/api/discussions -H "Authorization: Bearer $INTERNAL_API_TOKEN" -H "Content-Type: application/json" -d '{"title": "CRITICAL: Redis memory > 90%", "severity": "critical", "initiator": "ron", "participants": ["mira"], "context": {...}}'
RULE: CRITICAL and HIGH alerts create a discussion for audit trail
RULE: discussion includes: symptom, impact assessment, attempted remediation, recommended action
RULE: human approval required before destructive remediation (rollback, scale-down, halt)
ALERTING:ON_CALL_PATTERNS¶
AGENT_ON_CALL_ROTATION¶
PURPOSE: define which monitoring agent handles alerts at any given time
GE_ON_CALL_MODEL:
PRIMARY: ron (System Integrity Monitor) — always on-call for system checks
SECONDARY: annegreet (Knowledge Curator) — on-call for learning pipeline issues
ESCALATION: mira (Escalation Manager) — handles all escalations
OBSERVER: eltjo (Cross-Session Analyst) — monitors patterns, not incidents
INCIDENT: nessa (Incident Response) — activated for P1 incidents
RULE: primary on-call handles all initial alerts
RULE: secondary on-call handles domain-specific alerts only
RULE: escalation manager is always available as backup
CHECK: is the on-call agent currently active (status != unavailable)?
IF: unavailable THEN: alerts have no handler — escalate to next tier immediately
HANDOFF_PROTOCOL¶
RULE: when an agent goes to maintenance, transfer active incidents to next in chain
RULE: handoff includes: active incidents, recent alerts, context for open investigations
RULE: log handoff to session_learnings for audit
ALERTING:ALERT_FATIGUE_PREVENTION¶
DEDUPLICATION¶
PURPOSE: prevent the same condition from generating multiple alerts
RULE: same check failing on consecutive runs = 1 alert, not N alerts
RULE: dedup key = check_name + resource_identifier
RULE: dedup window = 2x the check interval
TECHNIQUE:
1. alert fires: check dedup store for matching key
2. if key exists and within window: suppress (increment count only)
3. if key missing or window expired: emit alert, store key with TTL
4. when condition resolves: emit resolution alert, clear key
ANTI_PATTERN: every health check cycle produces a new alert for the same failure
FIX: dedup — one alert per condition, with a count of consecutive failures
FLAPPING_DETECTION¶
PURPOSE: detect and suppress alerts that fire and resolve repeatedly
RULE: if alert fires + resolves > 3 times in 1 hour, it is flapping
RULE: flapping alerts are suppressed and converted to a single "flapping" alert
RULE: investigate root cause of flapping — usually a threshold too close to normal value
CHECK: any alerts that fired and resolved > 3 times today?
IF: yes THEN: flapping — adjust threshold or add hysteresis
TECHNIQUE: hysteresis (different thresholds for fire and resolve)
FIRE when: redis_memory_percent > 80
RESOLVE when: redis_memory_percent < 70
(10% gap prevents flapping around the threshold)
ALERT_CORRELATION¶
PURPOSE: group related alerts into a single incident
RULE: if Redis goes down, suppress all alerts that depend on Redis
RULE: correlation rule: if infrastructure alert fires, suppress downstream application alerts
RULE: present correlated alerts as one incident with a root cause
CORRELATION_RULES:
IF: redis_down THEN: suppress stream_length, consumer_lag, task_dispatch alerts
IF: postgres_down THEN: suppress registry_sync, admin_ui_health, team_routing alerts
IF: node_disk_full THEN: suppress all pod alerts on that node
IF: network_policy_removed THEN: suppress specific connectivity alerts
ANTI_PATTERN: 20 alerts fire simultaneously because Redis is down
FIX: root cause correlation — emit 1 alert for Redis, suppress the 19 symptoms
COOLDOWN_PERIODS¶
RULE: after an alert fires, minimum 15 minutes before it can re-fire
RULE: after an incident is resolved, minimum 30 minutes cooldown
RULE: cooldown prevents alert storms during recovery
ALERTING:SLO_BASED_ALERTING¶
ERROR_BUDGET_BURN_RATE¶
PURPOSE: alert based on how fast the error budget is being consumed, not on individual failures
FORMULA:
burn_rate = (error_rate_in_window / (1 - SLO_target))
Example (99.5% SLO):
if error_rate_in_window = 1%:
burn_rate = 0.01 / 0.005 = 2x
meaning: budget will exhaust in 15 days instead of 30
ALERT_THRESHOLDS:
FAST_BURN (CRITICAL):
window: 5 minutes
burn_rate: > 14.4x (budget exhausts in ~2 days)
action: page immediately
MEDIUM_BURN (HIGH):
window: 30 minutes
burn_rate: > 6x (budget exhausts in ~5 days)
action: alert during business hours
SLOW_BURN (MEDIUM):
window: 6 hours
burn_rate: > 2x (budget exhausts in ~15 days)
action: daily digest, prioritize reliability work
NORMAL:
burn_rate: <= 1x
action: no alert, budget on track
CHECK: what is the current burn rate?
IF: > 14.4x THEN: CRITICAL — budget exhausting rapidly, likely active incident
IF: > 6x THEN: HIGH — budget under pressure, investigate
IF: > 2x THEN: MEDIUM — trend is concerning, plan reliability work
MULTI_WINDOW_VALIDATION¶
PURPOSE: reduce false positives by requiring the condition in both short and long windows
TECHNIQUE:
alert fires ONLY when BOTH conditions are true:
1. short_window (5 min) burn_rate exceeds threshold
2. long_window (1 hour) burn_rate exceeds threshold * 0.5
This ensures:
- short spikes alone do not page (could be transient)
- sustained degradation does page (even at lower rate)
ANTI_PATTERN: alerting on 1-minute error rate spikes
FIX: multi-window validation filters transient spikes from real problems
ALERTING:RUNBOOK_LINKED_ALERTS¶
RUNBOOK_STRUCTURE¶
PURPOSE: every alert links to a runbook that tells the responder what to do
REQUIRED_RUNBOOK_SECTIONS:
1. SYMPTOM: what the alert means in plain language
2. IMPACT: what is affected and how severely
3. INVESTIGATION: steps to diagnose root cause
4. REMEDIATION: steps to fix the problem
5. ESCALATION: when and how to escalate
6. VERIFICATION: how to confirm the fix worked
7. PREVENTION: what to change to prevent recurrence
RULE: every CRITICAL and HIGH alert MUST have a runbook
RULE: runbooks live in wiki at wiki/docs/development/runbooks/
RULE: update runbooks after every incident — incorporate lessons learned
CHECK: does the alert include a link to its runbook?
IF: no THEN: responder wastes time figuring out what to do — add the link
IF: yes THEN: responder can start remediation immediately
ANTI_PATTERN: runbooks that say "investigate and fix"
FIX: runbooks must have specific, step-by-step instructions
ANTI_PATTERN: runbooks that are never updated
FIX: post-incident review MUST include runbook update if steps were wrong or missing
ALERTING:INCIDENT_CORRELATION¶
INCIDENT_LIFECYCLE¶
PURPOSE: track alerts from detection through resolution
LIFECYCLE:
1. DETECTED: alert fires, incident created
2. ACKNOWLEDGED: responder claims the incident
3. INVESTIGATING: responder is diagnosing root cause
4. MITIGATING: responder is applying fix
5. RESOLVED: fix confirmed, alert cleared
6. POST_MORTEM: root cause documented, prevention planned
RULE: every P1 and P2 incident completes the full lifecycle
RULE: every P1 incident gets a post-mortem within 48 hours
RULE: post-mortem is blameless — focus on system improvements
RULE: post-mortem findings feed back to learning pipeline (session_learnings)
CHECK: are incidents being closed without post-mortem?
IF: yes THEN: learning opportunity missed — enforce post-mortem for P1/P2
TIMELINE_RECONSTRUCTION¶
PURPOSE: build a complete timeline of what happened during an incident
TECHNIQUE:
1. collect all alerts that fired within the incident window
2. collect all logs with matching correlation_ids
3. collect all metrics anomalies in the incident window
4. order chronologically
5. identify: triggering event → cascade → detection → response → resolution
RULE: timeline includes both automated events (alerts, auto-remediation) and human actions
RULE: timeline is stored with the incident record for future reference
ANTI_PATTERN: post-mortem based on memory instead of data
FIX: reconstruct timeline from telemetry — memory is unreliable during incidents