DOMAIN:SYSTEM_INTEGRITY — PITFALLS¶
OWNER: ron
ALSO_USED_BY: gerco, thijmen, rutger, annegreet
UPDATED: 2026-03-26
SCOPE: false positive drift, alert fatigue, check ordering, race conditions, monitoring overhead
PITFALLS:FALSE_POSITIVE_DRIFT¶
EXPECTED_CHANGES_VS_REAL_DRIFT¶
PURPOSE: distinguish between intentional state changes and unauthorized drift
PROBLEM: not all differences between declared and actual state are drift
EXAMPLE: HPA scales replicas from 2 to 4 — this is expected, not drift
EXAMPLE: k8s adds default labels/annotations — not drift, just defaults
EXAMPLE: resource requests adjusted by VPA — expected if VPA is enabled
RULE: maintain a list of EXPECTED_VARIANCE fields that are NOT drift
EXPECTED_VARIANCE:
- spec.replicas (HPA manages this)
- metadata.resourceVersion (k8s internal)
- metadata.generation (k8s internal)
- metadata.creationTimestamp (immutable after creation)
- status.* (all status fields are runtime state)
- metadata.annotations["kubectl.kubernetes.io/*"] (kubectl bookkeeping)
- metadata.annotations["deployment.kubernetes.io/*"] (rollout tracking)
- metadata.managedFields (server-side apply tracking)
CHECK: is the detected drift in an EXPECTED_VARIANCE field?
IF: yes THEN: suppress — this is normal k8s behavior
IF: no THEN: real drift — classify by severity and act
ANTI_PATTERN: alerting on every kubectl diff output line
FIX: filter EXPECTED_VARIANCE fields before classifying drift
ANTI_PATTERN: suppressing too many fields, hiding real drift
FIX: review suppression list quarterly — remove entries that no longer apply
DEPLOYMENT_WINDOW_FALSE_POSITIVES¶
PROBLEM: during a rolling deployment, old and new state coexist
RULE: drift detection MUST account for in-progress deployments
CHECK: is a rollout currently in progress?
RUN: kubectl rollout status deployment/<name> -n <ns> --timeout=0
IF: rollout in progress THEN: suppress drift alerts for that deployment
IF: rollout complete THEN: resume drift detection
ANTI_PATTERN: drift alerts during every deployment
FIX: pause drift detection for deploying resources, resume after rollout completes
RULE: deployment window suppression has a timeout — if rollout takes > 10 minutes, alert anyway
CONFIG_RELOAD_DELAY¶
PROBLEM: after a ConfigMap update, pods may still use the old config until restart
CHECK: pod was created before ConfigMap last update
IF: yes THEN: pod may be running stale config — this is expected until restart
IF: restart policy is rolling THEN: pods will pick up new config gradually
RULE: do not flag ConfigMap drift if a rolling restart is scheduled
RULE: flag ConfigMap drift if no restart is scheduled within 1 hour of update
PITFALLS:ALERT_FATIGUE¶
CAUSES_OF_ALERT_FATIGUE¶
PROBLEM: too many alerts cause operators to ignore all alerts, including critical ones
COMMON_CAUSES:
1. alerting on EXPECTED_VARIANCE (HPA scaling, rollout state)
2. alerting on LOW severity without aggregation
3. duplicate alerts from overlapping checks
4. alerts that fire and auto-resolve repeatedly (flapping)
5. alerts without severity classification (everything looks the same)
6. alerts on symptoms AND root cause simultaneously
CHECK: how many alerts fired in the last 24 hours?
IF: > 50 THEN: alert fatigue risk — review and deduplicate
IF: > 100 THEN: alert fatigue is active — operators are ignoring alerts
PREVENTION_STRATEGIES¶
RULE: every alert MUST have a severity classification (CRITICAL, HIGH, MEDIUM, LOW)
RULE: LOW severity = log only, NEVER page or notify
RULE: MEDIUM severity = daily digest, not real-time notification
RULE: only CRITICAL and HIGH generate real-time notifications
RULE: aggregate related alerts into a single incident
EXAMPLE: if 5 executor pods fail readiness simultaneously, that is 1 incident, not 5 alerts
RULE: deduplicate alerts — same check failing on consecutive runs = 1 alert, not N
RULE: set cooldown periods — after alert fires, suppress re-fire for 15 minutes minimum
RULE: flapping detection — if alert fires and clears > 3 times in 1 hour, suppress and investigate
ANTI_PATTERN: every check produces its own independent alert
FIX: correlate checks — if Redis is down, suppress all downstream Redis-dependent alerts
ANTI_PATTERN: alerts with no actionable remediation
FIX: if you cannot act on it, it is not an alert — it is a metric
ALERT_REVIEW_CADENCE¶
RULE: review all alerts weekly — remove or recalibrate those that are noise
RULE: track signal-to-noise ratio: actionable_alerts / total_alerts (target: > 80%)
RULE: if an alert has not been actionable in 30 days, demote or remove it
PITFALLS:CHECK_ORDERING_DEPENDENCIES¶
DEPENDENCY_AWARE_CHECK_EXECUTION¶
PROBLEM: some checks depend on other checks passing first
EXAMPLE: checking Redis stream lengths is pointless if Redis is down
EXAMPLE: checking agent registry sync is pointless if PostgreSQL is unreachable
EXAMPLE: checking certificate expiry is pointless if the TLS secret does not exist
RULE: order checks by dependency — infrastructure first, then application, then data
CHECK_ORDER:
PHASE 1 (infrastructure):
- network connectivity (can we reach Redis, PostgreSQL, k8s API?)
- DNS resolution (do service names resolve?)
- disk space (is there room for logs and data?)
PHASE 2 (platform):
- Redis health (PING, memory, clients)
- PostgreSQL health (connection, query, disk)
- k8s API health (can we list resources?)
PHASE 3 (application):
- pod health (running, ready, restart count)
- stream lengths (within MAXLEN bounds)
- config drift (manifests vs live state)
PHASE 4 (data):
- registry sync (JSON vs DB)
- secret rotation status
- certificate expiry
CHECK: did Phase 1 pass?
IF: no THEN: skip Phase 2-4 — results would be misleading
IF: yes THEN: proceed to Phase 2
ANTI_PATTERN: running all checks in parallel without dependency awareness
FIX: if infrastructure is down, application checks will all fail with misleading errors
ANTI_PATTERN: treating infrastructure check failure as an application problem
FIX: root cause attribution — a Redis timeout is infra:network, not app:stream_error
CIRCULAR_DEPENDENCY_IN_CHECKS¶
PROBLEM: monitoring system depends on the system it monitors
EXAMPLE: health check writes results to PostgreSQL — if PostgreSQL is down, health check fails to report
EXAMPLE: alert system sends via Redis — if Redis is down, alerts cannot be sent
RULE: health checks MUST have a fallback reporting path
RULE: if DB is unreachable, write health results to local file as fallback
RULE: if Redis is unreachable, use direct HTTP notification as fallback
RULE: the monitoring system should have the fewest possible dependencies
PITFALLS:RACE_CONDITIONS_IN_VALIDATION¶
TIME_OF_CHECK_TIME_OF_USE¶
PROBLEM: state can change between when you check it and when you act on it
EXAMPLE: drift check reads deployment state → rollout starts → drift check reports "no drift" on now-stale data
EXAMPLE: secret rotation check says "compliant" → secret expires 1 minute later
EXAMPLE: stream length check says "within bounds" → burst of XADDs exceeds MAXLEN before next check
RULE: drift detection results have a TTL — stale results are not trustworthy
RULE: TTL = check_interval / 2 (e.g., 15-min check interval → results valid for 7.5 min)
RULE: when acting on a check result, re-verify before remediation
TECHNIQUE: optimistic locking for remediation
1. detect drift (first check)
2. plan remediation
3. re-check drift (second check, immediately before remediation)
4. if drift still present: apply remediation
5. if drift resolved: skip remediation (something else fixed it)
ANTI_PATTERN: remediating based on a 15-minute-old check result
FIX: re-verify immediately before any state-changing remediation
CONCURRENT_CHECK_INTERFERENCE¶
PROBLEM: multiple drift detection instances running simultaneously can interfere
EXAMPLE: two orchestrator replicas both detect same drift and both attempt remediation
EXAMPLE: drift check and deployment overlap — check sees partial state
RULE: use distributed locking for remediation actions
RULE: drift detection can run concurrently (read-only), remediation must be serialized
RULE: use Redis SETNX or similar for remediation locks with TTL
ANTI_PATTERN: multiple replicas all running the same drift remediation
FIX: leader election or distributed lock before any remediation action
PITFALLS:RESOURCE_OVERHEAD_OF_MONITORING¶
MONITORING_COST¶
PROBLEM: monitoring itself consumes CPU, memory, network, and sometimes money
COST_SOURCES:
kubectl commands: k8s API server load, RBAC evaluation
redis-cli commands: Redis CPU, network bandwidth
sha256sum on large files: CPU spikes during checksum computation
DB queries for metrics: PostgreSQL connection pool, query CPU
LLM calls for analysis: direct cost ($) — NEVER in monitoring hot path
RULE: monitoring overhead MUST be < 5% of monitored system resources
RULE: NEVER call LLM APIs from health check or drift detection code
RULE: batch k8s API calls — one kubectl get pods -A is better than 50 individual gets
RULE: cache check results — do not re-query if cache is within TTL
CHECK: monitoring processes using > 5% of any resource?
IF: yes THEN: monitoring is too heavy — reduce frequency or batch operations
ANTI_PATTERN: running kubectl get in a tight loop
FIX: run on schedule with appropriate intervals, cache results between runs
ANTI_PATTERN: health check endpoint that computes everything on every request
FIX: compute health on schedule, serve cached result on request
MONITORING_THE_MONITOR¶
PROBLEM: if the monitoring system fails silently, drift goes undetected
RULE: monitoring jobs MUST emit a heartbeat — absence of heartbeat = monitor failure
RULE: use a dead man's switch — external service that alerts if it does NOT receive a ping
RULE: host cron is more reliable than in-cluster CronJobs for critical monitoring
CHECK: when did the monitoring job last run successfully?
IF: > 2x the expected interval THEN: monitoring job has failed — investigate
TECHNIQUE: heartbeat-based meta-monitoring
1. monitoring job completes: write timestamp to Redis key with TTL
2. meta-monitor checks: does the key exist?
3. if key expired (TTL elapsed without refresh): monitoring job is dead
4. meta-monitor alerts via independent channel (not the same Redis)
ANTI_PATTERN: monitoring system that depends on the same infrastructure it monitors
FIX: meta-monitoring should use an independent path (host cron, external service)
ANTI_PATTERN: assuming CronJobs always run
FIX: CronJobs can be suspended, nodes can be overloaded, pods can fail to schedule — always verify