DOMAIN:SYSTEM_INTEGRITY — DRIFT_DETECTION¶

OWNER: ron
ALSO_USED_BY: gerco, thijmen, rutger, annegreet
UPDATED: 2026-03-26
SCOPE: configuration drift detection patterns, desired-state vs actual-state comparison, drift categories, remediation strategies

DRIFT_DETECTION:CORE_PRINCIPLE¶

PURPOSE: detect when actual system state diverges from declared desired state before the divergence causes incidents

RULE: drift = diff(ACTUAL_STATE, DESIRED_STATE) where DESIRED_STATE is always git-committed
RULE: every drift detection must produce a structured diff, not a boolean pass/fail
RULE: drift detection runs on schedule — never only on-demand

CHECK: is desired state defined in version control?
IF: no THEN: you cannot detect drift — define desired state first
IF: yes THEN: compare actual state against the committed version

ANTI_PATTERN: treating "it works" as evidence of no drift
FIX: systems can drift silently for weeks before the drift causes a visible failure

ANTI_PATTERN: relying on human memory for what the configuration should be
FIX: desired state MUST be codified in git — manifests, config files, or DB migrations

DRIFT_DETECTION:DRIFT_CATEGORIES¶

RESOURCE_LIMIT_DRIFT¶

PURPOSE: detect when running resource limits differ from declared manifests

TOOL: kubectl
RUN: kubectl get deployments -A -o json | jq '.items[] | {ns: .metadata.namespace, name: .metadata.name, cpu: .spec.template.spec.containers[0].resources.limits.cpu, mem: .spec.template.spec.containers[0].resources.limits.memory}'

CHECK: running CPU/memory limits match k8s manifest files in k8s/base/
IF: limits differ THEN: someone applied a manual change — check kubectl get events for who
IF: HPA maxReplicas > 5 THEN: CRITICAL — cost burn risk, see CLAUDE.md binding rules

DETECTION_METHOD: snapshot-and-diff

1. parse manifests from git: extract resources.limits per container  
2. query live state: kubectl get deployment -o json  
3. compare field-by-field: cpu, memory, ephemeral-storage  
4. classify: MATCH, DRIFT_UP (more than declared), DRIFT_DOWN (less than declared)  
5. DRIFT_UP on limits = potential cost overrun  
6. DRIFT_DOWN on limits = potential OOM/throttle risk

ANTI_PATTERN: comparing only deployment-level fields, ignoring container-level overrides
FIX: iterate all containers in pod spec, not just containers[0]

ENV_VAR_DRIFT¶

PURPOSE: detect when environment variables in running pods differ from manifests

TOOL: kubectl
RUN: kubectl get deployment <name> -n <ns> -o json | jq '.spec.template.spec.containers[0].env'

CHECK: env vars in running pod match manifest declaration
CHECK: no extra env vars injected that are not in the manifest
CHECK: env vars referencing ConfigMaps/Secrets point to correct versions

IF: extra env var found not in manifest THEN: manual injection — investigate source
IF: env var value differs THEN: ConfigMap or Secret was updated without manifest change

DRIFT_SEVERITY: HIGH if env var controls security behavior (API keys, auth flags)
DRIFT_SEVERITY: MEDIUM if env var controls operational behavior (log levels, feature flags)
DRIFT_SEVERITY: LOW if env var is informational (version labels, build metadata)

ANTI_PATTERN: comparing env vars without resolving ConfigMap/Secret references
FIX: resolve valueFrom references before comparison — the actual value matters

SECRET_DRIFT¶

PURPOSE: detect when secrets in cluster differ from expected state

TOOL: kubectl
RUN: kubectl get secrets -A -o json | jq '.items[] | {ns: .metadata.namespace, name: .metadata.name, keys: (.data | keys), age: .metadata.creationTimestamp}'

CHECK: secret key names match expected set (do NOT compare values in plain text)
CHECK: secret creation timestamp is newer than last known rotation
CHECK: no orphaned secrets exist (secrets not referenced by any pod)
CHECK: ge-secrets in ge-agents namespace has expected keys

IF: secret has unexpected keys THEN: investigate — may be unauthorized addition
IF: secret age > rotation policy THEN: overdue for rotation — flag for review

RULE: NEVER log or store secret values during drift detection
RULE: compare key names and metadata only — not the actual secret data

DRIFT_SEVERITY: CRITICAL if secret referenced by pod does not exist (pod will crash)
DRIFT_SEVERITY: HIGH if secret key set changed unexpectedly
DRIFT_SEVERITY: MEDIUM if rotation overdue

CRONJOB_DRIFT¶

PURPOSE: detect when CronJob configurations differ from manifests

TOOL: kubectl
RUN: kubectl get cronjobs -A -o json | jq '.items[] | {ns: .metadata.namespace, name: .metadata.name, schedule: .spec.schedule, suspended: .spec.suspend, image: .spec.jobTemplate.spec.template.spec.containers[0].image}'

CHECK: schedule field matches manifest
CHECK: suspended field matches manifest (false unless explicitly suspended)
CHECK: container image matches expected version
CHECK: no CronJobs exist that are not declared in manifests

IF: CronJob suspended without corresponding manifest change THEN: unauthorized suspension
IF: schedule changed THEN: timing drift — may cause missed or double executions
IF: unknown CronJob found THEN: investigate origin — may be test artifact or attack

ANTI_PATTERN: only checking if CronJobs exist, not their configuration
FIX: full field comparison including schedule, image, env vars, resource limits

NETWORK_POLICY_DRIFT¶

PURPOSE: detect when network policies differ from declared state

TOOL: kubectl
RUN: kubectl get networkpolicies -A -o json | jq '.items[] | {ns: .metadata.namespace, name: .metadata.name, podSelector: .spec.podSelector, ingress: (.spec.ingress | length), egress: (.spec.egress | length)}'

CHECK: network policies exist for all namespaces that should be isolated
CHECK: ingress/egress rules match declared policies
CHECK: no network policies have been deleted (weakening isolation)
CHECK: orchestrator-to-postgres policy exists in ge-agents namespace

IF: network policy missing THEN: CRITICAL — namespace is wide open
IF: rules added that are not in manifest THEN: investigate — may be emergency fix or breach
IF: rules removed THEN: CRITICAL — isolation weakened

DRIFT_SEVERITY: CRITICAL for any network policy removal or weakening
DRIFT_SEVERITY: HIGH for undeclared rule additions
DRIFT_SEVERITY: MEDIUM for selector changes

DRIFT_DETECTION:COMPARISON_TECHNIQUES¶

KUBECTL_DIFF¶

PURPOSE: compare declared manifests against live cluster state

TOOL: kubectl
RUN: kubectl diff -f k8s/base/

CHECK: exit code 0 means no drift
CHECK: exit code 1 means drift detected — review the diff output
CHECK: exit code > 1 means error in comparison — fix before trusting results

RULE: kubectl diff shows what would change if you applied the manifest
RULE: some fields are set by k8s itself (status, metadata.resourceVersion) — ignore these
RULE: run kubectl diff against each manifest directory separately for targeted detection

ANTI_PATTERN: running kubectl diff against the entire k8s/ tree and ignoring errors
FIX: iterate directories, capture exit codes, classify drift per resource

SNAPSHOT_AND_DIFF¶

PURPOSE: capture state at known-good time, compare against current state

TECHNIQUE:

1. after successful deployment: capture full state as JSON baseline  
2. store baseline in DB (session_learnings table with scope:baseline)  
3. on schedule: capture current state as JSON  
4. compute structured diff between baseline and current  
5. classify each field difference by severity  
6. if drift found: log, alert, or remediate based on severity

TOOL: jq for JSON diff computation
RUN: diff <(jq -S . baseline.json) <(jq -S . current.json)

CHECK: baseline exists and is not stale (older than last deployment)
IF: baseline missing THEN: capture new baseline before detecting drift
IF: baseline older than last deployment THEN: baseline is stale — recapture

RULE: baselines MUST be updated after every intentional change
RULE: store baselines in DB, not in memory — agent restarts should not lose baselines

GITOPS_RECONCILIATION¶

PURPOSE: ensure cluster state matches git repository state

TECHNIQUE:

1. git pull latest manifests from repo  
2. for each manifest: kubectl diff against live state  
3. any diff = drift  
4. classify drift: intentional (PR-backed) vs unintentional  
5. unintentional drift: alert and optionally auto-revert

CHECK: every live resource has a corresponding manifest in git
CHECK: every manifest in git has a corresponding live resource
IF: live resource without manifest THEN: orphaned resource — investigate
IF: manifest without live resource THEN: failed deployment — investigate

ANTI_PATTERN: only checking resources that are in git, missing manually created resources
FIX: enumerate ALL resources in namespace, then cross-reference with git manifests

DRIFT_DETECTION:POLICY_AS_CODE¶

OPA_CONFTEST¶

PURPOSE: validate configurations against policies before deployment

TOOL: conftest
RUN: conftest test k8s/base/ --policy policy/

CHECK: all manifests pass policy checks before apply
CHECK: policies cover security baselines (no root, no hostNetwork, resource limits set)
CHECK: policies enforce GE-specific rules (MAXLEN on streams, HPA caps)

RULE: conftest runs in CI/CD pipeline — blocks deployment on policy violation
RULE: policies are version-controlled alongside manifests
RULE: new policies require review — overly strict policies block legitimate changes

EXAMPLE_POLICY (Rego):

package kubernetes.deny  

deny[msg] {  
    input.kind == "Deployment"  
    container := input.spec.template.spec.containers[_]  
    not container.resources.limits  
    msg := sprintf("Container %s must have resource limits", [container.name])  
}  

deny[msg] {  
    input.kind == "HorizontalPodAutoscaler"  
    input.spec.maxReplicas > 5  
    msg := sprintf("HPA %s maxReplicas exceeds 5 (cost burn risk)", [input.metadata.name])  
}

CUSTOM_VALIDATION_SCRIPTS¶

PURPOSE: GE-specific drift checks that OPA cannot express

TOOL: bash + jq + redis-cli
RUN: bash scripts/verify-executor-safety.sh

CHECK: MAXLEN enforced on all XADD calls (grep source code)
CHECK: cost_gate.py thresholds match CLAUDE.md binding rules
CHECK: hook depth limits are correct
CHECK: chain depth limits are correct

RULE: custom scripts cover GE-specific invariants not expressible in Rego
RULE: custom scripts MUST exit non-zero on violation
RULE: run custom scripts alongside conftest — both must pass

DRIFT_DETECTION:REMEDIATION_STRATEGY¶

AUTO_REMEDIATE_VS_ALERT_ONLY¶

RULE: auto-remediation is ONLY safe for LOW severity drift
RULE: MEDIUM and above MUST alert a human or escalation agent before remediation
RULE: CRITICAL drift MUST halt the component and escalate to mira

DECISION_MATRIX:

SEVERITY    ACTION              EXAMPLE  
CRITICAL    halt + escalate     network policy removed, secret exposed  
HIGH        alert + investigate HPA exceeded, CronJob suspended  
MEDIUM      alert + schedule    resource limits differ, schedule changed  
LOW         log + auto-fix      label mismatch, annotation drift

CHECK: is the drift caused by an intentional change not yet committed?
IF: yes THEN: do not remediate — remind human to commit the change
IF: no THEN: follow decision matrix based on severity

ANTI_PATTERN: auto-remediating everything to reduce alert volume
FIX: auto-remediation without human review hides security incidents

ANTI_PATTERN: alerting on everything without severity classification
FIX: unclassified alerts cause fatigue — always classify before routing

REMEDIATION_TECHNIQUES¶

TECHNIQUE: kubectl apply (revert to declared state)

1. verify the manifest in git is the correct desired state  
2. kubectl apply -f <manifest> --dry-run=server (preview)  
3. review the dry-run output  
4. if safe: kubectl apply -f <manifest>  
5. verify drift resolved: kubectl diff -f <manifest> exits 0

TECHNIQUE: git revert (undo committed change)

1. identify the commit that introduced the drift  
2. git revert <commit> on a feature branch  
3. PR review → merge → deploy  
4. verify drift resolved

TECHNIQUE: escalation (human decision required)

1. capture full state snapshot  
2. log to session_learnings with severity  
3. create discussion via admin-ui API  
4. wait for human decision  
5. execute human-approved remediation

RULE: always verify remediation resolved the drift — re-run detection after fix
RULE: log every remediation action to session_learnings for audit trail

DRIFT_DETECTION:SCHEDULING¶

DETECTION_INTERVALS¶

RULE: match detection frequency to severity of what could drift

SCHEDULE:

EVERY 5 MINUTES:   network policies, RBAC roles, secret existence  
EVERY 15 MINUTES:  resource limits, HPA config, stream lengths  
EVERY 1 HOUR:      CronJob config, env vars, image versions  
EVERY 6 HOURS:     full manifest reconciliation (kubectl diff)  
EVERY 24 HOURS:    orphaned resource scan, baseline refresh

CHECK: detection jobs are themselves monitored (meta-monitoring)
IF: detection job fails silently THEN: drift goes undetected — monitor the monitor
IF: detection job takes longer than its interval THEN: overlap risk — add mutex

ANTI_PATTERN: running all checks at the same frequency
FIX: high-severity items need high-frequency checks, low-severity can be daily

ANTI_PATTERN: running drift detection only during business hours
FIX: drift happens 24/7, especially from automated processes — run continuously

DRIFT_DETECTION:GE_SPECIFIC_CHECKS¶

AGENT_REGISTRY_DRIFT¶

PURPOSE: detect when AGENT-REGISTRY.json and PostgreSQL agents table diverge

TOOL: psql, jq
RUN: compare agent count, status, provider fields between JSON and DB

CHECK: every agent in AGENT-REGISTRY.json exists in DB
CHECK: status field matches (active/unavailable/maintenance)
CHECK: provider and providerModel fields match
CHECK: no phantom agents in DB without registry entry

IF: agent status differs THEN: HIGH — agent may silently not receive work
IF: provider differs THEN: MEDIUM — wrong LLM will be used for execution

REDIS_STREAM_DRIFT¶

PURPOSE: detect when Redis stream state diverges from expected bounds

TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: XLEN triggers.{agent} for each active agent
RUN: XLEN ge:work:incoming

CHECK: per-agent stream length <= 100 (MAXLEN bound)
CHECK: system stream length <= 1000 (MAXLEN bound)
CHECK: no consumer groups with pending entries > 50

IF: stream exceeds MAXLEN THEN: CRITICAL — unbounded growth, find XADD without MAXLEN
IF: consumer lag > 50 THEN: HIGH — executor falling behind

COST_GATE_DRIFT¶

PURPOSE: verify cost_gate.py thresholds match CLAUDE.md binding rules

TOOL: grep
RUN: grep -n 'SESSION_LIMIT\|AGENT_HOUR_LIMIT\|DAILY_LIMIT' ge_agent/execution/cost_gate.py

CHECK: per-session limit = $5
CHECK: per-agent-hour limit = $10
CHECK: daily system limit = $100

IF: any threshold differs from CLAUDE.md THEN: CRITICAL — cost controls compromised
RULE: these values are binding — changes require human approval