DOMAIN:SYSTEM_INTEGRITY — DRIFT_DETECTION¶
OWNER: ron
ALSO_USED_BY: gerco, thijmen, rutger, annegreet
UPDATED: 2026-03-26
SCOPE: configuration drift detection patterns, desired-state vs actual-state comparison, drift categories, remediation strategies
DRIFT_DETECTION:CORE_PRINCIPLE¶
PURPOSE: detect when actual system state diverges from declared desired state before the divergence causes incidents
RULE: drift = diff(ACTUAL_STATE, DESIRED_STATE) where DESIRED_STATE is always git-committed
RULE: every drift detection must produce a structured diff, not a boolean pass/fail
RULE: drift detection runs on schedule — never only on-demand
CHECK: is desired state defined in version control?
IF: no THEN: you cannot detect drift — define desired state first
IF: yes THEN: compare actual state against the committed version
ANTI_PATTERN: treating "it works" as evidence of no drift
FIX: systems can drift silently for weeks before the drift causes a visible failure
ANTI_PATTERN: relying on human memory for what the configuration should be
FIX: desired state MUST be codified in git — manifests, config files, or DB migrations
DRIFT_DETECTION:DRIFT_CATEGORIES¶
RESOURCE_LIMIT_DRIFT¶
PURPOSE: detect when running resource limits differ from declared manifests
TOOL: kubectl
RUN: kubectl get deployments -A -o json | jq '.items[] | {ns: .metadata.namespace, name: .metadata.name, cpu: .spec.template.spec.containers[0].resources.limits.cpu, mem: .spec.template.spec.containers[0].resources.limits.memory}'
CHECK: running CPU/memory limits match k8s manifest files in k8s/base/
IF: limits differ THEN: someone applied a manual change — check kubectl get events for who
IF: HPA maxReplicas > 5 THEN: CRITICAL — cost burn risk, see CLAUDE.md binding rules
DETECTION_METHOD: snapshot-and-diff
1. parse manifests from git: extract resources.limits per container
2. query live state: kubectl get deployment -o json
3. compare field-by-field: cpu, memory, ephemeral-storage
4. classify: MATCH, DRIFT_UP (more than declared), DRIFT_DOWN (less than declared)
5. DRIFT_UP on limits = potential cost overrun
6. DRIFT_DOWN on limits = potential OOM/throttle risk
ANTI_PATTERN: comparing only deployment-level fields, ignoring container-level overrides
FIX: iterate all containers in pod spec, not just containers[0]
ENV_VAR_DRIFT¶
PURPOSE: detect when environment variables in running pods differ from manifests
TOOL: kubectl
RUN: kubectl get deployment <name> -n <ns> -o json | jq '.spec.template.spec.containers[0].env'
CHECK: env vars in running pod match manifest declaration
CHECK: no extra env vars injected that are not in the manifest
CHECK: env vars referencing ConfigMaps/Secrets point to correct versions
IF: extra env var found not in manifest THEN: manual injection — investigate source
IF: env var value differs THEN: ConfigMap or Secret was updated without manifest change
DRIFT_SEVERITY: HIGH if env var controls security behavior (API keys, auth flags)
DRIFT_SEVERITY: MEDIUM if env var controls operational behavior (log levels, feature flags)
DRIFT_SEVERITY: LOW if env var is informational (version labels, build metadata)
ANTI_PATTERN: comparing env vars without resolving ConfigMap/Secret references
FIX: resolve valueFrom references before comparison — the actual value matters
SECRET_DRIFT¶
PURPOSE: detect when secrets in cluster differ from expected state
TOOL: kubectl
RUN: kubectl get secrets -A -o json | jq '.items[] | {ns: .metadata.namespace, name: .metadata.name, keys: (.data | keys), age: .metadata.creationTimestamp}'
CHECK: secret key names match expected set (do NOT compare values in plain text)
CHECK: secret creation timestamp is newer than last known rotation
CHECK: no orphaned secrets exist (secrets not referenced by any pod)
CHECK: ge-secrets in ge-agents namespace has expected keys
IF: secret has unexpected keys THEN: investigate — may be unauthorized addition
IF: secret age > rotation policy THEN: overdue for rotation — flag for review
RULE: NEVER log or store secret values during drift detection
RULE: compare key names and metadata only — not the actual secret data
DRIFT_SEVERITY: CRITICAL if secret referenced by pod does not exist (pod will crash)
DRIFT_SEVERITY: HIGH if secret key set changed unexpectedly
DRIFT_SEVERITY: MEDIUM if rotation overdue
CRONJOB_DRIFT¶
PURPOSE: detect when CronJob configurations differ from manifests
TOOL: kubectl
RUN: kubectl get cronjobs -A -o json | jq '.items[] | {ns: .metadata.namespace, name: .metadata.name, schedule: .spec.schedule, suspended: .spec.suspend, image: .spec.jobTemplate.spec.template.spec.containers[0].image}'
CHECK: schedule field matches manifest
CHECK: suspended field matches manifest (false unless explicitly suspended)
CHECK: container image matches expected version
CHECK: no CronJobs exist that are not declared in manifests
IF: CronJob suspended without corresponding manifest change THEN: unauthorized suspension
IF: schedule changed THEN: timing drift — may cause missed or double executions
IF: unknown CronJob found THEN: investigate origin — may be test artifact or attack
ANTI_PATTERN: only checking if CronJobs exist, not their configuration
FIX: full field comparison including schedule, image, env vars, resource limits
NETWORK_POLICY_DRIFT¶
PURPOSE: detect when network policies differ from declared state
TOOL: kubectl
RUN: kubectl get networkpolicies -A -o json | jq '.items[] | {ns: .metadata.namespace, name: .metadata.name, podSelector: .spec.podSelector, ingress: (.spec.ingress | length), egress: (.spec.egress | length)}'
CHECK: network policies exist for all namespaces that should be isolated
CHECK: ingress/egress rules match declared policies
CHECK: no network policies have been deleted (weakening isolation)
CHECK: orchestrator-to-postgres policy exists in ge-agents namespace
IF: network policy missing THEN: CRITICAL — namespace is wide open
IF: rules added that are not in manifest THEN: investigate — may be emergency fix or breach
IF: rules removed THEN: CRITICAL — isolation weakened
DRIFT_SEVERITY: CRITICAL for any network policy removal or weakening
DRIFT_SEVERITY: HIGH for undeclared rule additions
DRIFT_SEVERITY: MEDIUM for selector changes
DRIFT_DETECTION:COMPARISON_TECHNIQUES¶
KUBECTL_DIFF¶
PURPOSE: compare declared manifests against live cluster state
TOOL: kubectl
RUN: kubectl diff -f k8s/base/
CHECK: exit code 0 means no drift
CHECK: exit code 1 means drift detected — review the diff output
CHECK: exit code > 1 means error in comparison — fix before trusting results
RULE: kubectl diff shows what would change if you applied the manifest
RULE: some fields are set by k8s itself (status, metadata.resourceVersion) — ignore these
RULE: run kubectl diff against each manifest directory separately for targeted detection
ANTI_PATTERN: running kubectl diff against the entire k8s/ tree and ignoring errors
FIX: iterate directories, capture exit codes, classify drift per resource
SNAPSHOT_AND_DIFF¶
PURPOSE: capture state at known-good time, compare against current state
TECHNIQUE:
1. after successful deployment: capture full state as JSON baseline
2. store baseline in DB (session_learnings table with scope:baseline)
3. on schedule: capture current state as JSON
4. compute structured diff between baseline and current
5. classify each field difference by severity
6. if drift found: log, alert, or remediate based on severity
TOOL: jq for JSON diff computation
RUN: diff <(jq -S . baseline.json) <(jq -S . current.json)
CHECK: baseline exists and is not stale (older than last deployment)
IF: baseline missing THEN: capture new baseline before detecting drift
IF: baseline older than last deployment THEN: baseline is stale — recapture
RULE: baselines MUST be updated after every intentional change
RULE: store baselines in DB, not in memory — agent restarts should not lose baselines
GITOPS_RECONCILIATION¶
PURPOSE: ensure cluster state matches git repository state
TECHNIQUE:
1. git pull latest manifests from repo
2. for each manifest: kubectl diff against live state
3. any diff = drift
4. classify drift: intentional (PR-backed) vs unintentional
5. unintentional drift: alert and optionally auto-revert
CHECK: every live resource has a corresponding manifest in git
CHECK: every manifest in git has a corresponding live resource
IF: live resource without manifest THEN: orphaned resource — investigate
IF: manifest without live resource THEN: failed deployment — investigate
ANTI_PATTERN: only checking resources that are in git, missing manually created resources
FIX: enumerate ALL resources in namespace, then cross-reference with git manifests
DRIFT_DETECTION:POLICY_AS_CODE¶
OPA_CONFTEST¶
PURPOSE: validate configurations against policies before deployment
TOOL: conftest
RUN: conftest test k8s/base/ --policy policy/
CHECK: all manifests pass policy checks before apply
CHECK: policies cover security baselines (no root, no hostNetwork, resource limits set)
CHECK: policies enforce GE-specific rules (MAXLEN on streams, HPA caps)
RULE: conftest runs in CI/CD pipeline — blocks deployment on policy violation
RULE: policies are version-controlled alongside manifests
RULE: new policies require review — overly strict policies block legitimate changes
EXAMPLE_POLICY (Rego):
package kubernetes.deny
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.resources.limits
msg := sprintf("Container %s must have resource limits", [container.name])
}
deny[msg] {
input.kind == "HorizontalPodAutoscaler"
input.spec.maxReplicas > 5
msg := sprintf("HPA %s maxReplicas exceeds 5 (cost burn risk)", [input.metadata.name])
}
CUSTOM_VALIDATION_SCRIPTS¶
PURPOSE: GE-specific drift checks that OPA cannot express
TOOL: bash + jq + redis-cli
RUN: bash scripts/verify-executor-safety.sh
CHECK: MAXLEN enforced on all XADD calls (grep source code)
CHECK: cost_gate.py thresholds match CLAUDE.md binding rules
CHECK: hook depth limits are correct
CHECK: chain depth limits are correct
RULE: custom scripts cover GE-specific invariants not expressible in Rego
RULE: custom scripts MUST exit non-zero on violation
RULE: run custom scripts alongside conftest — both must pass
DRIFT_DETECTION:REMEDIATION_STRATEGY¶
AUTO_REMEDIATE_VS_ALERT_ONLY¶
RULE: auto-remediation is ONLY safe for LOW severity drift
RULE: MEDIUM and above MUST alert a human or escalation agent before remediation
RULE: CRITICAL drift MUST halt the component and escalate to mira
DECISION_MATRIX:
SEVERITY ACTION EXAMPLE
CRITICAL halt + escalate network policy removed, secret exposed
HIGH alert + investigate HPA exceeded, CronJob suspended
MEDIUM alert + schedule resource limits differ, schedule changed
LOW log + auto-fix label mismatch, annotation drift
CHECK: is the drift caused by an intentional change not yet committed?
IF: yes THEN: do not remediate — remind human to commit the change
IF: no THEN: follow decision matrix based on severity
ANTI_PATTERN: auto-remediating everything to reduce alert volume
FIX: auto-remediation without human review hides security incidents
ANTI_PATTERN: alerting on everything without severity classification
FIX: unclassified alerts cause fatigue — always classify before routing
REMEDIATION_TECHNIQUES¶
TECHNIQUE: kubectl apply (revert to declared state)
1. verify the manifest in git is the correct desired state
2. kubectl apply -f <manifest> --dry-run=server (preview)
3. review the dry-run output
4. if safe: kubectl apply -f <manifest>
5. verify drift resolved: kubectl diff -f <manifest> exits 0
TECHNIQUE: git revert (undo committed change)
1. identify the commit that introduced the drift
2. git revert <commit> on a feature branch
3. PR review → merge → deploy
4. verify drift resolved
TECHNIQUE: escalation (human decision required)
1. capture full state snapshot
2. log to session_learnings with severity
3. create discussion via admin-ui API
4. wait for human decision
5. execute human-approved remediation
RULE: always verify remediation resolved the drift — re-run detection after fix
RULE: log every remediation action to session_learnings for audit trail
DRIFT_DETECTION:SCHEDULING¶
DETECTION_INTERVALS¶
RULE: match detection frequency to severity of what could drift
SCHEDULE:
EVERY 5 MINUTES: network policies, RBAC roles, secret existence
EVERY 15 MINUTES: resource limits, HPA config, stream lengths
EVERY 1 HOUR: CronJob config, env vars, image versions
EVERY 6 HOURS: full manifest reconciliation (kubectl diff)
EVERY 24 HOURS: orphaned resource scan, baseline refresh
CHECK: detection jobs are themselves monitored (meta-monitoring)
IF: detection job fails silently THEN: drift goes undetected — monitor the monitor
IF: detection job takes longer than its interval THEN: overlap risk — add mutex
ANTI_PATTERN: running all checks at the same frequency
FIX: high-severity items need high-frequency checks, low-severity can be daily
ANTI_PATTERN: running drift detection only during business hours
FIX: drift happens 24/7, especially from automated processes — run continuously
DRIFT_DETECTION:GE_SPECIFIC_CHECKS¶
AGENT_REGISTRY_DRIFT¶
PURPOSE: detect when AGENT-REGISTRY.json and PostgreSQL agents table diverge
TOOL: psql, jq
RUN: compare agent count, status, provider fields between JSON and DB
CHECK: every agent in AGENT-REGISTRY.json exists in DB
CHECK: status field matches (active/unavailable/maintenance)
CHECK: provider and providerModel fields match
CHECK: no phantom agents in DB without registry entry
IF: agent status differs THEN: HIGH — agent may silently not receive work
IF: provider differs THEN: MEDIUM — wrong LLM will be used for execution
REDIS_STREAM_DRIFT¶
PURPOSE: detect when Redis stream state diverges from expected bounds
TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: XLEN triggers.{agent} for each active agent
RUN: XLEN ge:work:incoming
CHECK: per-agent stream length <= 100 (MAXLEN bound)
CHECK: system stream length <= 1000 (MAXLEN bound)
CHECK: no consumer groups with pending entries > 50
IF: stream exceeds MAXLEN THEN: CRITICAL — unbounded growth, find XADD without MAXLEN
IF: consumer lag > 50 THEN: HIGH — executor falling behind
COST_GATE_DRIFT¶
PURPOSE: verify cost_gate.py thresholds match CLAUDE.md binding rules
TOOL: grep
RUN: grep -n 'SESSION_LIMIT\|AGENT_HOUR_LIMIT\|DAILY_LIMIT' ge_agent/execution/cost_gate.py
CHECK: per-session limit = $5
CHECK: per-agent-hour limit = $10
CHECK: daily system limit = $100
IF: any threshold differs from CLAUDE.md THEN: CRITICAL — cost controls compromised
RULE: these values are binding — changes require human approval