DOMAIN:MONITORING — PITFALLS¶
OWNER: eltjo, annegreet
ALSO_USED_BY: ron, mira, nessa
UPDATED: 2026-03-26
SCOPE: cardinality explosion, log volume cost, alert storms, meta-monitoring, sampling bias
PITFALLS:CARDINALITY_EXPLOSION¶
WHAT_IS_CARDINALITY_EXPLOSION¶
PURPOSE: understand why high-cardinality labels destroy metric storage and query performance
PROBLEM: every unique combination of label values creates a new time series
EXAMPLE: metric with labels {agent, work_type, status_code} — 54 agents x 20 work_types x 10 status_codes = 10,800 series from ONE metric
EXAMPLE: adding request_id as a label — millions of unique values = millions of time series
RULE: label cardinality MUST be bounded — no unbounded labels
RULE: maximum combined cardinality per metric: 10,000 series
RULE: labels with > 100 unique values are HIGH RISK — review before adding
CHECK: what is the cardinality of each metric?
IF: > 10,000 series THEN: cardinality explosion — remove or bucket high-cardinality labels
IF: growing unbounded THEN: CRITICAL — storage will fill, queries will timeout
HIGH_RISK_LABELS¶
LABELS_TO_NEVER_USE:
request_id: unique per request — unbounded
session_id: unique per session — unbounded
trace_id: unique per trace — unbounded
user_id: grows with user base — use logs instead
error_message: near-infinite variations — use error_class
file_path: too many unique values — use service name
stack_trace_hash: too many unique values — use error fingerprint
LABELS_SAFE_TO_USE:
service: bounded (< 10 GE services)
agent: bounded (54 agents)
work_type: bounded (< 30 types)
status_code: bounded (< 20 codes)
severity: bounded (4 levels)
team: bounded (alfa, bravo, zulu, shared)
provider: bounded (anthropic, openai, google)
ANTI_PATTERN: adding customer_id as a metric label
FIX: use logs for per-customer data — metrics should aggregate across customers
ANTI_PATTERN: creating a new metric per agent instead of using labels
FIX: ge_tasks_total{agent="boris"} not ge_boris_tasks_total
REMEDIATION¶
TECHNIQUE: reduce cardinality by bucketing
Instead of: {status_code="200"}, {status_code="201"}, {status_code="204"}
Use: {status_class="2xx"}, {status_class="4xx"}, {status_class="5xx"}
Instead of: {duration_ms="142"}, {duration_ms="289"}, {duration_ms="1042"}
Use: histogram buckets [0.1, 0.5, 1.0, 5.0, 10.0] seconds
TECHNIQUE: drop unused label combinations
If metric is only queried by {service, status_class}:
drop {agent, work_type} labels — they are not used and waste storage
PITFALLS:LOG_VOLUME_COST¶
COST_DRIVERS¶
PROBLEM: log storage, indexing, and querying consume resources proportional to volume
COST_COMPONENTS:
INGESTION: processing each log line (CPU, network)
STORAGE: disk space for log data (compressed)
INDEXING: building search indices (CPU, memory, disk)
QUERYING: searching through log data (CPU, memory)
RETENTION: keeping old logs around (disk)
ESTIMATION:
avg_line_size: 500 bytes (structured JSON)
lines_per_second: varies by service
ge-orchestrator: ~10 lines/sec (routing decisions, health checks)
ge-executor: ~50 lines/sec during execution, ~1 line/sec idle
admin-ui: ~5 lines/sec (API requests, UI rendering)
redis: ~2 lines/sec (slow log only)
total: ~70 lines/sec * 500 bytes = 35 KB/sec = ~3 GB/day uncompressed
compressed (10:1): ~300 MB/day
30-day retention: ~9 GB compressed
CHECK: is actual log volume matching the estimate?
IF: > 2x estimate THEN: something is logging too much — find and fix
IF: < 0.5x estimate THEN: logs may be lost — check collector health
VOLUME_REDUCTION_TECHNIQUES¶
TECHNIQUE: log level enforcement
PRODUCTION: INFO and above only (no DEBUG, no TRACE)
STAGING: DEBUG allowed
DEVELOPMENT: TRACE allowed
RULE: DEBUG/TRACE in production is the #1 cause of log volume spikes
CHECK: is any production service running at DEBUG level?
IF: yes THEN: reduce to INFO immediately — DEBUG can 10x log volume
TECHNIQUE: sampling for high-frequency events
Instead of logging every health check (every 10 seconds):
log every 10th health check (every 100 seconds)
log ALL failed health checks (never sample errors)
RULE: NEVER sample error logs — every error is potentially important
RULE: sample INFO logs for high-frequency repetitive events
TECHNIQUE: message deduplication
If same log message repeats > 10 times in 1 minute:
emit once with count: "Message X repeated 47 times in last minute"
ANTI_PATTERN: logging inside tight loops
FIX: log before/after the loop, not inside — 1 million iterations = 1 million log lines
ANTI_PATTERN: logging full request/response bodies
FIX: log metadata (size, status, duration) — bodies go to dedicated audit log if needed
ANTI_PATTERN: logging stack traces for expected errors (e.g., validation failures)
FIX: stack traces only for unexpected errors — validation failures need message only
PITFALLS:ALERT_STORM_CASCADES¶
WHAT_IS_AN_ALERT_STORM¶
PROBLEM: a single infrastructure failure triggers dozens or hundreds of alerts
EXAMPLE:
TRIGGER: Redis goes down
ALERTS THAT FIRE:
1. redis_health_check CRITICAL
2. stream_length_check FAILED (cannot connect)
3. consumer_lag_check FAILED (cannot connect)
4. orchestrator_health DEGRADED (cannot route)
5. executor_health DEGRADED (cannot consume)
6. admin_ui_health DEGRADED (cannot dispatch)
7. task_dispatch_latency HIGH (tasks not moving)
8. agent_status_check FAILED (cannot read stream)
... (potentially 20+ alerts)
RESULT: operator receives 20+ notifications in 2 minutes — cannot identify root cause in the noise
PREVENTION_STRATEGIES¶
TECHNIQUE: dependency-aware alert suppression
IF: redis_down alert fires
THEN: suppress ALL alerts tagged with dependency:redis
KEEP: only the root cause alert (redis_down)
EMIT: single incident with correlated symptoms
TECHNIQUE: alert grouping by time window
1. collect all alerts that fire within a 2-minute window
2. group by probable root cause (shared dependency)
3. emit single incident per group
4. list individual alerts as symptoms within the incident
TECHNIQUE: hierarchical alert evaluation
1. evaluate infrastructure alerts first (Redis, PostgreSQL, network)
2. if infrastructure alert fires: mark dependent alerts as suppressed
3. evaluate application alerts only for non-suppressed components
4. evaluate data alerts only for non-suppressed data stores
RULE: alert evaluation order matches dependency graph
RULE: suppressed alerts are still logged — just not notified
RULE: when root cause resolves, re-evaluate suppressed alerts
ANTI_PATTERN: flat alert evaluation (all alerts independent)
FIX: dependency graph determines evaluation order and suppression
ANTI_PATTERN: suppressing too aggressively (masking secondary failures)
FIX: re-evaluate after root cause resolves — some symptoms may indicate additional problems
GE_SPECIFIC_CASCADE_RULES¶
CASCADES:
redis_down → suppress:
- stream_length_check
- consumer_lag_check
- orchestrator_routing
- executor_consumption
- task_dispatch_latency
postgres_down → suppress:
- registry_sync_check
- admin_ui_api_health
- team_routing_check
- sla_measurement
- learning_pipeline
node_pressure → suppress:
- all pod health alerts on affected node
- resource limit drift on affected pods
PITFALLS:MONITORING_THE_MONITORING_SYSTEM¶
THE_META_MONITORING_PROBLEM¶
PROBLEM: if the monitoring system fails silently, problems go undetected
EXAMPLES:
1. log collector pod crashes — logs stop flowing, no alerts fire
2. health dump cron job fails — admin-ui shows stale data
3. alert routing broken — alerts generated but never delivered
4. metrics storage full — new metrics dropped silently
5. drift detection script has a bug — reports no drift when drift exists
RULE: monitoring systems MUST emit heartbeats
RULE: heartbeat absence = monitoring failure = alert via independent channel
DEAD_MANS_SWITCH¶
PURPOSE: alert when a periodic process STOPS running (absence of signal)
TECHNIQUE:
1. monitoring job completes: POST to dead man's switch URL
2. dead man's switch expects POST every N minutes
3. if POST is missed: switch triggers alert via independent channel
4. independent channel = NOT the same system being monitored
CHECK: does every critical monitoring job have a dead man's switch?
IF: no THEN: that monitoring job can fail silently — add a heartbeat
GE_HEARTBEATS:
health-dump cron: write timestamp to /tmp/health-dump-heartbeat
drift-detection cron: write timestamp to Redis key with TTL
alert-routing: send test alert every hour (verify delivery)
log-collector: emit "collector_alive" log every 60 seconds
RULE: check heartbeat freshness as part of the monitoring loop
RULE: heartbeat check uses a DIFFERENT path than the monitored system
ANTI_PATTERN: monitoring system checks its own health and reports "healthy"
FIX: external verification — something OUTSIDE the system verifies the system is working
HOST_CRON_VS_K8S_CRONJOB¶
RULE: critical monitoring runs on host cron, not k8s CronJobs
REASON: k8s CronJobs can fail to schedule if cluster is under pressure
REASON: health-dump.sh already uses host cron (known working pattern)
CHECK: which monitoring jobs are k8s CronJobs?
IF: critical monitoring is a CronJob THEN: consider migrating to host cron
IF: non-critical monitoring is a CronJob THEN: acceptable, but add heartbeat
PITFALLS:SAMPLING_BIAS¶
WHAT_IS_SAMPLING_BIAS¶
PROBLEM: when you sample telemetry data, the sample may not represent the full population
EXAMPLES:
1. trace sampling at 10%: rare errors may never be sampled
2. log sampling by service: high-traffic services dominate the sample
3. metric aggregation by hour: sub-minute spikes are invisible
4. only monitoring agent sessions: non-agent traffic is invisible
SAMPLING_STRATEGIES¶
TECHNIQUE: head-based sampling (decide at request start)
PROS: simple, consistent trace completeness
CONS: rare events may be missed
USE WHEN: high-volume, uniform traffic
TECHNIQUE: tail-based sampling (decide at request end)
PROS: always captures errors and slow requests
CONS: requires buffering all spans until decision
USE WHEN: you need 100% error capture with reduced normal traffic
TECHNIQUE: error-biased sampling
RULE: sample 100% of errors (never miss a failure)
RULE: sample 10% of successful requests (reduce volume)
RULE: sample 100% of requests > p99 latency (capture outliers)
RULE: NEVER sample errors — 100% error capture is mandatory
RULE: success sampling rate depends on traffic volume
RULE: adjust sampling rate dynamically based on error rate
CHECK: are errors being sampled (dropped)?
IF: yes THEN: CRITICAL — fix immediately, error visibility is non-negotiable
AGGREGATION_BIAS¶
PROBLEM: aggregating metrics hides important details
EXAMPLE: p50 latency is 100ms, p99 is 5000ms — average (250ms) hides the tail
EXAMPLE: hourly error count is 10, but all 10 happened in a 2-minute burst
EXAMPLE: agent "boris" has 0% error rate, but only processed 1 task (not statistically significant)
RULE: always report percentiles (p50, p95, p99), not just averages
RULE: always report sample size alongside rates — low sample = unreliable rate
RULE: for low-traffic agents, require minimum 10 samples before computing rates
ANTI_PATTERN: alerting on error rate for agents with < 10 tasks
FIX: 1 error out of 2 tasks = 50% error rate, but not statistically meaningful
ANTI_PATTERN: reporting only averages for latency
FIX: averages hide tail latency — p99 matters more for user experience
ANTI_PATTERN: comparing metrics across agents with vastly different traffic volumes
FIX: normalize by traffic volume or use per-request metrics