DOMAIN:SYSTEM_INTEGRITY — PITFALLS¶

OWNER: ron
ALSO_USED_BY: gerco, thijmen, rutger, annegreet
UPDATED: 2026-03-26
SCOPE: false positive drift, alert fatigue, check ordering, race conditions, monitoring overhead

PITFALLS:FALSE_POSITIVE_DRIFT¶

EXPECTED_CHANGES_VS_REAL_DRIFT¶

PURPOSE: distinguish between intentional state changes and unauthorized drift

PROBLEM: not all differences between declared and actual state are drift
EXAMPLE: HPA scales replicas from 2 to 4 — this is expected, not drift
EXAMPLE: k8s adds default labels/annotations — not drift, just defaults
EXAMPLE: resource requests adjusted by VPA — expected if VPA is enabled

RULE: maintain a list of EXPECTED_VARIANCE fields that are NOT drift

EXPECTED_VARIANCE:  
- spec.replicas (HPA manages this)  
- metadata.resourceVersion (k8s internal)  
- metadata.generation (k8s internal)  
- metadata.creationTimestamp (immutable after creation)  
- status.* (all status fields are runtime state)  
- metadata.annotations["kubectl.kubernetes.io/*"] (kubectl bookkeeping)  
- metadata.annotations["deployment.kubernetes.io/*"] (rollout tracking)  
- metadata.managedFields (server-side apply tracking)

CHECK: is the detected drift in an EXPECTED_VARIANCE field?
IF: yes THEN: suppress — this is normal k8s behavior
IF: no THEN: real drift — classify by severity and act

ANTI_PATTERN: alerting on every kubectl diff output line
FIX: filter EXPECTED_VARIANCE fields before classifying drift

ANTI_PATTERN: suppressing too many fields, hiding real drift
FIX: review suppression list quarterly — remove entries that no longer apply

DEPLOYMENT_WINDOW_FALSE_POSITIVES¶

PROBLEM: during a rolling deployment, old and new state coexist

RULE: drift detection MUST account for in-progress deployments
CHECK: is a rollout currently in progress?
RUN: kubectl rollout status deployment/<name> -n <ns> --timeout=0
IF: rollout in progress THEN: suppress drift alerts for that deployment
IF: rollout complete THEN: resume drift detection

ANTI_PATTERN: drift alerts during every deployment
FIX: pause drift detection for deploying resources, resume after rollout completes

RULE: deployment window suppression has a timeout — if rollout takes > 10 minutes, alert anyway

CONFIG_RELOAD_DELAY¶

PROBLEM: after a ConfigMap update, pods may still use the old config until restart

CHECK: pod was created before ConfigMap last update
IF: yes THEN: pod may be running stale config — this is expected until restart
IF: restart policy is rolling THEN: pods will pick up new config gradually

RULE: do not flag ConfigMap drift if a rolling restart is scheduled
RULE: flag ConfigMap drift if no restart is scheduled within 1 hour of update

PITFALLS:ALERT_FATIGUE¶

CAUSES_OF_ALERT_FATIGUE¶

PROBLEM: too many alerts cause operators to ignore all alerts, including critical ones

COMMON_CAUSES:

1. alerting on EXPECTED_VARIANCE (HPA scaling, rollout state)  
2. alerting on LOW severity without aggregation  
3. duplicate alerts from overlapping checks  
4. alerts that fire and auto-resolve repeatedly (flapping)  
5. alerts without severity classification (everything looks the same)  
6. alerts on symptoms AND root cause simultaneously

CHECK: how many alerts fired in the last 24 hours?
IF: > 50 THEN: alert fatigue risk — review and deduplicate
IF: > 100 THEN: alert fatigue is active — operators are ignoring alerts

PREVENTION_STRATEGIES¶

RULE: every alert MUST have a severity classification (CRITICAL, HIGH, MEDIUM, LOW)
RULE: LOW severity = log only, NEVER page or notify
RULE: MEDIUM severity = daily digest, not real-time notification
RULE: only CRITICAL and HIGH generate real-time notifications

RULE: aggregate related alerts into a single incident
EXAMPLE: if 5 executor pods fail readiness simultaneously, that is 1 incident, not 5 alerts

RULE: deduplicate alerts — same check failing on consecutive runs = 1 alert, not N
RULE: set cooldown periods — after alert fires, suppress re-fire for 15 minutes minimum
RULE: flapping detection — if alert fires and clears > 3 times in 1 hour, suppress and investigate

ANTI_PATTERN: every check produces its own independent alert
FIX: correlate checks — if Redis is down, suppress all downstream Redis-dependent alerts

ANTI_PATTERN: alerts with no actionable remediation
FIX: if you cannot act on it, it is not an alert — it is a metric

ALERT_REVIEW_CADENCE¶

RULE: review all alerts weekly — remove or recalibrate those that are noise
RULE: track signal-to-noise ratio: actionable_alerts / total_alerts (target: > 80%)
RULE: if an alert has not been actionable in 30 days, demote or remove it

PITFALLS:CHECK_ORDERING_DEPENDENCIES¶

DEPENDENCY_AWARE_CHECK_EXECUTION¶

PROBLEM: some checks depend on other checks passing first

EXAMPLE: checking Redis stream lengths is pointless if Redis is down
EXAMPLE: checking agent registry sync is pointless if PostgreSQL is unreachable
EXAMPLE: checking certificate expiry is pointless if the TLS secret does not exist

RULE: order checks by dependency — infrastructure first, then application, then data

CHECK_ORDER:

PHASE 1 (infrastructure):  
  - network connectivity (can we reach Redis, PostgreSQL, k8s API?)  
  - DNS resolution (do service names resolve?)  
  - disk space (is there room for logs and data?)  

PHASE 2 (platform):  
  - Redis health (PING, memory, clients)  
  - PostgreSQL health (connection, query, disk)  
  - k8s API health (can we list resources?)  

PHASE 3 (application):  
  - pod health (running, ready, restart count)  
  - stream lengths (within MAXLEN bounds)  
  - config drift (manifests vs live state)  

PHASE 4 (data):  
  - registry sync (JSON vs DB)  
  - secret rotation status  
  - certificate expiry

CHECK: did Phase 1 pass?
IF: no THEN: skip Phase 2-4 — results would be misleading
IF: yes THEN: proceed to Phase 2

ANTI_PATTERN: running all checks in parallel without dependency awareness
FIX: if infrastructure is down, application checks will all fail with misleading errors

ANTI_PATTERN: treating infrastructure check failure as an application problem
FIX: root cause attribution — a Redis timeout is infra:network, not app:stream_error

CIRCULAR_DEPENDENCY_IN_CHECKS¶

PROBLEM: monitoring system depends on the system it monitors

EXAMPLE: health check writes results to PostgreSQL — if PostgreSQL is down, health check fails to report
EXAMPLE: alert system sends via Redis — if Redis is down, alerts cannot be sent

RULE: health checks MUST have a fallback reporting path
RULE: if DB is unreachable, write health results to local file as fallback
RULE: if Redis is unreachable, use direct HTTP notification as fallback
RULE: the monitoring system should have the fewest possible dependencies

PITFALLS:RACE_CONDITIONS_IN_VALIDATION¶

TIME_OF_CHECK_TIME_OF_USE¶

PROBLEM: state can change between when you check it and when you act on it

EXAMPLE: drift check reads deployment state → rollout starts → drift check reports "no drift" on now-stale data
EXAMPLE: secret rotation check says "compliant" → secret expires 1 minute later
EXAMPLE: stream length check says "within bounds" → burst of XADDs exceeds MAXLEN before next check

RULE: drift detection results have a TTL — stale results are not trustworthy
RULE: TTL = check_interval / 2 (e.g., 15-min check interval → results valid for 7.5 min)
RULE: when acting on a check result, re-verify before remediation

TECHNIQUE: optimistic locking for remediation

1. detect drift (first check)  
2. plan remediation  
3. re-check drift (second check, immediately before remediation)  
4. if drift still present: apply remediation  
5. if drift resolved: skip remediation (something else fixed it)

ANTI_PATTERN: remediating based on a 15-minute-old check result
FIX: re-verify immediately before any state-changing remediation

CONCURRENT_CHECK_INTERFERENCE¶

PROBLEM: multiple drift detection instances running simultaneously can interfere

EXAMPLE: two orchestrator replicas both detect same drift and both attempt remediation
EXAMPLE: drift check and deployment overlap — check sees partial state

RULE: use distributed locking for remediation actions
RULE: drift detection can run concurrently (read-only), remediation must be serialized
RULE: use Redis SETNX or similar for remediation locks with TTL

ANTI_PATTERN: multiple replicas all running the same drift remediation
FIX: leader election or distributed lock before any remediation action

PITFALLS:RESOURCE_OVERHEAD_OF_MONITORING¶

MONITORING_COST¶

PROBLEM: monitoring itself consumes CPU, memory, network, and sometimes money

COST_SOURCES:

kubectl commands:        k8s API server load, RBAC evaluation  
redis-cli commands:      Redis CPU, network bandwidth  
sha256sum on large files: CPU spikes during checksum computation  
DB queries for metrics:  PostgreSQL connection pool, query CPU  
LLM calls for analysis:  direct cost ($) — NEVER in monitoring hot path

RULE: monitoring overhead MUST be < 5% of monitored system resources
RULE: NEVER call LLM APIs from health check or drift detection code
RULE: batch k8s API calls — one kubectl get pods -A is better than 50 individual gets
RULE: cache check results — do not re-query if cache is within TTL

CHECK: monitoring processes using > 5% of any resource?
IF: yes THEN: monitoring is too heavy — reduce frequency or batch operations

ANTI_PATTERN: running kubectl get in a tight loop
FIX: run on schedule with appropriate intervals, cache results between runs

ANTI_PATTERN: health check endpoint that computes everything on every request
FIX: compute health on schedule, serve cached result on request

MONITORING_THE_MONITOR¶

PROBLEM: if the monitoring system fails silently, drift goes undetected

RULE: monitoring jobs MUST emit a heartbeat — absence of heartbeat = monitor failure
RULE: use a dead man's switch — external service that alerts if it does NOT receive a ping
RULE: host cron is more reliable than in-cluster CronJobs for critical monitoring

CHECK: when did the monitoring job last run successfully?
IF: > 2x the expected interval THEN: monitoring job has failed — investigate

TECHNIQUE: heartbeat-based meta-monitoring

1. monitoring job completes: write timestamp to Redis key with TTL  
2. meta-monitor checks: does the key exist?  
3. if key expired (TTL elapsed without refresh): monitoring job is dead  
4. meta-monitor alerts via independent channel (not the same Redis)

ANTI_PATTERN: monitoring system that depends on the same infrastructure it monitors
FIX: meta-monitoring should use an independent path (host cron, external service)

ANTI_PATTERN: assuming CronJobs always run
FIX: CronJobs can be suspended, nodes can be overloaded, pods can fail to schedule — always verify