DOMAIN:MONITORING — PITFALLS¶

OWNER: eltjo, annegreet
ALSO_USED_BY: ron, mira, nessa
UPDATED: 2026-03-26
SCOPE: cardinality explosion, log volume cost, alert storms, meta-monitoring, sampling bias

PITFALLS:CARDINALITY_EXPLOSION¶

WHAT_IS_CARDINALITY_EXPLOSION¶

PURPOSE: understand why high-cardinality labels destroy metric storage and query performance

PROBLEM: every unique combination of label values creates a new time series
EXAMPLE: metric with labels {agent, work_type, status_code} — 54 agents x 20 work_types x 10 status_codes = 10,800 series from ONE metric
EXAMPLE: adding request_id as a label — millions of unique values = millions of time series

RULE: label cardinality MUST be bounded — no unbounded labels
RULE: maximum combined cardinality per metric: 10,000 series
RULE: labels with > 100 unique values are HIGH RISK — review before adding

CHECK: what is the cardinality of each metric?
IF: > 10,000 series THEN: cardinality explosion — remove or bucket high-cardinality labels
IF: growing unbounded THEN: CRITICAL — storage will fill, queries will timeout

HIGH_RISK_LABELS¶

LABELS_TO_NEVER_USE:

request_id:       unique per request — unbounded  
session_id:       unique per session — unbounded  
trace_id:         unique per trace — unbounded  
user_id:          grows with user base — use logs instead  
error_message:    near-infinite variations — use error_class  
file_path:        too many unique values — use service name  
stack_trace_hash: too many unique values — use error fingerprint

LABELS_SAFE_TO_USE:

service:          bounded (< 10 GE services)  
agent:            bounded (54 agents)  
work_type:        bounded (< 30 types)  
status_code:      bounded (< 20 codes)  
severity:         bounded (4 levels)  
team:             bounded (alfa, bravo, zulu, shared)  
provider:         bounded (anthropic, openai, google)

ANTI_PATTERN: adding customer_id as a metric label
FIX: use logs for per-customer data — metrics should aggregate across customers

ANTI_PATTERN: creating a new metric per agent instead of using labels
FIX: ge_tasks_total{agent="boris"} not ge_boris_tasks_total

REMEDIATION¶

TECHNIQUE: reduce cardinality by bucketing

Instead of: {status_code="200"}, {status_code="201"}, {status_code="204"}  
Use: {status_class="2xx"}, {status_class="4xx"}, {status_class="5xx"}  

Instead of: {duration_ms="142"}, {duration_ms="289"}, {duration_ms="1042"}  
Use: histogram buckets [0.1, 0.5, 1.0, 5.0, 10.0] seconds

TECHNIQUE: drop unused label combinations

If metric is only queried by {service, status_class}:  
  drop {agent, work_type} labels — they are not used and waste storage

PITFALLS:LOG_VOLUME_COST¶

COST_DRIVERS¶

PROBLEM: log storage, indexing, and querying consume resources proportional to volume

COST_COMPONENTS:

INGESTION:   processing each log line (CPU, network)  
STORAGE:     disk space for log data (compressed)  
INDEXING:    building search indices (CPU, memory, disk)  
QUERYING:    searching through log data (CPU, memory)  
RETENTION:   keeping old logs around (disk)

ESTIMATION:

avg_line_size: 500 bytes (structured JSON)  
lines_per_second: varies by service  

ge-orchestrator: ~10 lines/sec (routing decisions, health checks)  
ge-executor: ~50 lines/sec during execution, ~1 line/sec idle  
admin-ui: ~5 lines/sec (API requests, UI rendering)  
redis: ~2 lines/sec (slow log only)  

total: ~70 lines/sec * 500 bytes = 35 KB/sec = ~3 GB/day uncompressed  
compressed (10:1): ~300 MB/day  
30-day retention: ~9 GB compressed

CHECK: is actual log volume matching the estimate?
IF: > 2x estimate THEN: something is logging too much — find and fix
IF: < 0.5x estimate THEN: logs may be lost — check collector health

VOLUME_REDUCTION_TECHNIQUES¶

TECHNIQUE: log level enforcement

PRODUCTION: INFO and above only (no DEBUG, no TRACE)  
STAGING: DEBUG allowed  
DEVELOPMENT: TRACE allowed

RULE: DEBUG/TRACE in production is the #1 cause of log volume spikes
CHECK: is any production service running at DEBUG level?
IF: yes THEN: reduce to INFO immediately — DEBUG can 10x log volume

TECHNIQUE: sampling for high-frequency events

Instead of logging every health check (every 10 seconds):  
  log every 10th health check (every 100 seconds)  
  log ALL failed health checks (never sample errors)

RULE: NEVER sample error logs — every error is potentially important
RULE: sample INFO logs for high-frequency repetitive events

TECHNIQUE: message deduplication

If same log message repeats > 10 times in 1 minute:  
  emit once with count: "Message X repeated 47 times in last minute"

ANTI_PATTERN: logging inside tight loops
FIX: log before/after the loop, not inside — 1 million iterations = 1 million log lines

ANTI_PATTERN: logging full request/response bodies
FIX: log metadata (size, status, duration) — bodies go to dedicated audit log if needed

ANTI_PATTERN: logging stack traces for expected errors (e.g., validation failures)
FIX: stack traces only for unexpected errors — validation failures need message only

PITFALLS:ALERT_STORM_CASCADES¶

WHAT_IS_AN_ALERT_STORM¶

PROBLEM: a single infrastructure failure triggers dozens or hundreds of alerts

EXAMPLE:

TRIGGER: Redis goes down  
ALERTS THAT FIRE:  
  1. redis_health_check CRITICAL  
  2. stream_length_check FAILED (cannot connect)  
  3. consumer_lag_check FAILED (cannot connect)  
  4. orchestrator_health DEGRADED (cannot route)  
  5. executor_health DEGRADED (cannot consume)  
  6. admin_ui_health DEGRADED (cannot dispatch)  
  7. task_dispatch_latency HIGH (tasks not moving)  
  8. agent_status_check FAILED (cannot read stream)  
  ... (potentially 20+ alerts)

RESULT: operator receives 20+ notifications in 2 minutes — cannot identify root cause in the noise

PREVENTION_STRATEGIES¶

TECHNIQUE: dependency-aware alert suppression

IF: redis_down alert fires  
THEN: suppress ALL alerts tagged with dependency:redis  
KEEP: only the root cause alert (redis_down)  
EMIT: single incident with correlated symptoms

TECHNIQUE: alert grouping by time window

1. collect all alerts that fire within a 2-minute window  
2. group by probable root cause (shared dependency)  
3. emit single incident per group  
4. list individual alerts as symptoms within the incident

TECHNIQUE: hierarchical alert evaluation

1. evaluate infrastructure alerts first (Redis, PostgreSQL, network)  
2. if infrastructure alert fires: mark dependent alerts as suppressed  
3. evaluate application alerts only for non-suppressed components  
4. evaluate data alerts only for non-suppressed data stores

RULE: alert evaluation order matches dependency graph
RULE: suppressed alerts are still logged — just not notified
RULE: when root cause resolves, re-evaluate suppressed alerts

ANTI_PATTERN: flat alert evaluation (all alerts independent)
FIX: dependency graph determines evaluation order and suppression

ANTI_PATTERN: suppressing too aggressively (masking secondary failures)
FIX: re-evaluate after root cause resolves — some symptoms may indicate additional problems

GE_SPECIFIC_CASCADE_RULES¶

CASCADES:

redis_down → suppress:  
  - stream_length_check  
  - consumer_lag_check  
  - orchestrator_routing  
  - executor_consumption  
  - task_dispatch_latency  

postgres_down → suppress:  
  - registry_sync_check  
  - admin_ui_api_health  
  - team_routing_check  
  - sla_measurement  
  - learning_pipeline  

node_pressure → suppress:  
  - all pod health alerts on affected node  
  - resource limit drift on affected pods

PITFALLS:MONITORING_THE_MONITORING_SYSTEM¶

THE_META_MONITORING_PROBLEM¶

PROBLEM: if the monitoring system fails silently, problems go undetected

EXAMPLES:

1. log collector pod crashes — logs stop flowing, no alerts fire  
2. health dump cron job fails — admin-ui shows stale data  
3. alert routing broken — alerts generated but never delivered  
4. metrics storage full — new metrics dropped silently  
5. drift detection script has a bug — reports no drift when drift exists

RULE: monitoring systems MUST emit heartbeats
RULE: heartbeat absence = monitoring failure = alert via independent channel

DEAD_MANS_SWITCH¶

PURPOSE: alert when a periodic process STOPS running (absence of signal)

TECHNIQUE:

1. monitoring job completes: POST to dead man's switch URL  
2. dead man's switch expects POST every N minutes  
3. if POST is missed: switch triggers alert via independent channel  
4. independent channel = NOT the same system being monitored

CHECK: does every critical monitoring job have a dead man's switch?
IF: no THEN: that monitoring job can fail silently — add a heartbeat

GE_HEARTBEATS:

health-dump cron:       write timestamp to /tmp/health-dump-heartbeat  
drift-detection cron:   write timestamp to Redis key with TTL  
alert-routing:          send test alert every hour (verify delivery)  
log-collector:          emit "collector_alive" log every 60 seconds

RULE: check heartbeat freshness as part of the monitoring loop
RULE: heartbeat check uses a DIFFERENT path than the monitored system

ANTI_PATTERN: monitoring system checks its own health and reports "healthy"
FIX: external verification — something OUTSIDE the system verifies the system is working

HOST_CRON_VS_K8S_CRONJOB¶

RULE: critical monitoring runs on host cron, not k8s CronJobs
REASON: k8s CronJobs can fail to schedule if cluster is under pressure
REASON: health-dump.sh already uses host cron (known working pattern)

CHECK: which monitoring jobs are k8s CronJobs?
IF: critical monitoring is a CronJob THEN: consider migrating to host cron
IF: non-critical monitoring is a CronJob THEN: acceptable, but add heartbeat

PITFALLS:SAMPLING_BIAS¶

WHAT_IS_SAMPLING_BIAS¶

PROBLEM: when you sample telemetry data, the sample may not represent the full population

EXAMPLES:

1. trace sampling at 10%: rare errors may never be sampled  
2. log sampling by service: high-traffic services dominate the sample  
3. metric aggregation by hour: sub-minute spikes are invisible  
4. only monitoring agent sessions: non-agent traffic is invisible

SAMPLING_STRATEGIES¶

TECHNIQUE: head-based sampling (decide at request start)

PROS: simple, consistent trace completeness  
CONS: rare events may be missed  
USE WHEN: high-volume, uniform traffic

TECHNIQUE: tail-based sampling (decide at request end)

PROS: always captures errors and slow requests  
CONS: requires buffering all spans until decision  
USE WHEN: you need 100% error capture with reduced normal traffic

TECHNIQUE: error-biased sampling

RULE: sample 100% of errors (never miss a failure)  
RULE: sample 10% of successful requests (reduce volume)  
RULE: sample 100% of requests > p99 latency (capture outliers)

RULE: NEVER sample errors — 100% error capture is mandatory
RULE: success sampling rate depends on traffic volume
RULE: adjust sampling rate dynamically based on error rate

CHECK: are errors being sampled (dropped)?
IF: yes THEN: CRITICAL — fix immediately, error visibility is non-negotiable

AGGREGATION_BIAS¶

PROBLEM: aggregating metrics hides important details

EXAMPLE: p50 latency is 100ms, p99 is 5000ms — average (250ms) hides the tail
EXAMPLE: hourly error count is 10, but all 10 happened in a 2-minute burst
EXAMPLE: agent "boris" has 0% error rate, but only processed 1 task (not statistically significant)

RULE: always report percentiles (p50, p95, p99), not just averages
RULE: always report sample size alongside rates — low sample = unreliable rate
RULE: for low-traffic agents, require minimum 10 samples before computing rates

ANTI_PATTERN: alerting on error rate for agents with < 10 tasks
FIX: 1 error out of 2 tasks = 50% error rate, but not statistically meaningful

ANTI_PATTERN: reporting only averages for latency
FIX: averages hide tail latency — p99 matters more for user experience

ANTI_PATTERN: comparing metrics across agents with vastly different traffic volumes
FIX: normalize by traffic volume or use per-request metrics