Skip to content

DOMAIN:SYSTEM_INTEGRITY

OWNER: ron
UPDATED: 2026-03-24
SCOPE: configuration drift detection, system state verification, compliance monitoring
AGENTS: ron (System Integrity Monitor), mira (Escalation Manager)


SYSTEM_INTEGRITY:CONFIGURATION_DRIFT_DETECTION

PURPOSE: detect unauthorized or accidental changes to system configuration by comparing actual state vs declared state

WHAT_TO_CHECK

RULE: every check compares ACTUAL (runtime state) vs EXPECTED (config file / DB / manifest)

K8S_RESOURCE_LIMITS

TOOL: kubectl
RUN: kubectl get deployments -A -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, replicas: .spec.replicas, cpu_limit: .spec.template.spec.containers[0].resources.limits.cpu, mem_limit: .spec.template.spec.containers[0].resources.limits.memory}'

CHECK: deployment replicas match expected count
CHECK: resource limits match declared values in manifests under k8s/base/
CHECK: HPA maxReplicas <= 5 for executor (BINDING — see CLAUDE.md)
CHECK: container image tag matches expected (no :latest drift from rebuild)

EXPECTED_SOURCE: k8s manifest files in k8s/base/agents/, k8s/base/system/
DRIFT_SEVERITY: HIGH if HPA maxReplicas > 5 (cost burn risk)
DRIFT_SEVERITY: MEDIUM if resource limits differ from manifest
DRIFT_SEVERITY: LOW if replica count temporarily differs (HPA scaling is normal)

REDIS_STREAM_HEALTH

TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: redis-cli -p 6381 -a $REDIS_PASSWORD INFO memory — check used_memory_rss
RUN: redis-cli -p 6381 -a $REDIS_PASSWORD XLEN ge:work:incoming
RUN: for each agent: redis-cli -p 6381 -a $REDIS_PASSWORD XLEN triggers.{agent}

CHECK: stream length exceeds MAXLEN thresholds (100/agent, 1000/system)
IF: exceeded THEN: CRITICAL — XADD without MAXLEN is active, find source immediately
CHECK: memory usage > 80% of allocated
IF: yes THEN: WARNING — streams may need trimming, check for abandoned consumers

EXPECTED_SOURCE: MAXLEN values defined in code (task-service.ts, ge-orchestrator/main.py)
DRIFT_SEVERITY: CRITICAL if MAXLEN violated (unbounded growth = OOM risk)
DRIFT_SEVERITY: HIGH if memory > 80%
DRIFT_SEVERITY: LOW if stream lengths elevated but within bounds

CRONJOB_INVENTORY

TOOL: kubectl
RUN: kubectl get cronjobs -A -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, schedule: .spec.schedule, suspended: .spec.suspend, lastSchedule: .status.lastScheduleTime}'

CHECK: all expected CronJobs exist and are not suspended
CHECK: no unexpected CronJobs have been added
CHECK: schedule matches declared values
CHECK: lastScheduleTime is recent (not silently failing)

EXPECTED_SOURCE: declared CronJob manifests in k8s/base/
DRIFT_SEVERITY: HIGH if CronJob suspended without record (silent failure)
DRIFT_SEVERITY: MEDIUM if schedule changed
DRIFT_SEVERITY: LOW if new CronJob added (may be intentional)

FILE_PERMISSIONS

TOOL: stat, find
RUN: stat -c '%a %U %G %n' /home/claude/ge-bootstrap/config/*.yaml
RUN: stat -c '%a %U %G %n' /home/claude/ge-bootstrap/ge-ops/master/AGENT-REGISTRY.json

CHECK: config files are owned by claude:claude
CHECK: config files are not world-writable (no xx7 permissions)
CHECK: AGENT-REGISTRY.json permissions haven't changed
CHECK: no setuid/setgid bits on any GE files

EXPECTED_SOURCE: initial deployment permissions (captured at install time)
DRIFT_SEVERITY: CRITICAL if world-writable config files (security breach vector)
DRIFT_SEVERITY: HIGH if ownership changed
DRIFT_SEVERITY: LOW if group permissions differ

CONTAINER_IMAGE_VERSIONS

TOOL: kubectl, k3s
RUN: kubectl get pods -A -o json | jq '.items[] | {namespace: .metadata.namespace, pod: .metadata.name, containers: [.spec.containers[] | {name: .name, image: .image}]}'
RUN: sudo k3s ctr images ls | grep ge-bootstrap

CHECK: running image digest matches last built image
CHECK: no pods running stale images after a rollout
CHECK: imagePullPolicy is correct (Never for local builds, Always for registry)

EXPECTED_SOURCE: last build-executor.sh run timestamp vs pod creation timestamp
DRIFT_SEVERITY: HIGH if pod running image older than last build (stale deployment)
DRIFT_SEVERITY: MEDIUM if imagePullPolicy mismatch

REGISTRY_DB_SYNC

TOOL: psql or admin-ui API
RUN: compare AGENT-REGISTRY.json agent count and status vs agents table in PostgreSQL

CHECK: every agent in AGENT-REGISTRY.json exists in DB
CHECK: status field matches between JSON and DB
CHECK: provider/providerModel matches between JSON and DB
CHECK: no phantom agents in DB that don't exist in registry

EXPECTED_SOURCE: AGENT-REGISTRY.json is source of truth for agent definitions
DRIFT_SEVERITY: HIGH if agent status differs (agent may silently not receive work)
DRIFT_SEVERITY: MEDIUM if provider config differs (wrong LLM will be used)
DRIFT_SEVERITY: LOW if metadata fields differ

DRIFT_DETECTION_METHODOLOGY

RULE: always compare ACTUAL vs EXPECTED — never rely on "looks right"
RULE: capture baseline snapshots on every deployment
RULE: run drift detection on schedule (every 15 minutes for critical, hourly for medium, daily for low)

TECHNIQUE: snapshot-and-diff

1. capture current state as JSON snapshot
2. compare against last known-good baseline
3. classify each difference by severity
4. store diff in session_learnings if drift detected
5. auto-remediate LOW severity (log only)
6. alert for MEDIUM severity (notify ron)
7. escalate for HIGH/CRITICAL (notify mira → human)

TOOL: kubectl diff -f k8s/base/ — compare declared manifests vs live state
NOTE: kubectl diff exits 0 if no diff, 1 if diff exists, >1 if error

ANTI_PATTERN: checking drift only when something breaks
FIX: scheduled drift detection catches problems before they cause incidents

ANTI_PATTERN: storing baseline snapshots in memory only
FIX: persist baselines to DB or filesystem — agent restarts should not lose baseline

DRIFT_SEVERITY_CLASSIFICATION

  • file permissions allow unauthorized access
  • secrets exposed in environment variables or logs
  • RBAC roles changed without authorization
  • network policies removed or weakened
  • TLS certificates expired or downgraded
  • container running as root when it shouldn't

HIGH (operational risk)

  • HPA maxReplicas exceeded safe limit (5)
  • MAXLEN not enforced on Redis streams
  • cost_gate bypassed or disabled
  • agent status changed to unavailable without record
  • CronJob suspended without authorization
  • container image stale after deployment

MEDIUM (resource/config)

  • resource limits differ from manifest
  • CronJob schedule changed
  • provider config mismatch between registry and DB
  • Redis memory usage > 80%
  • pod restart count > 10 in last hour

LOW (cosmetic/informational)

  • replica count differs (HPA normal operation)
  • metadata field mismatch
  • log level changed
  • label/annotation differences

SYSTEM_INTEGRITY:COMPLIANCE_MAPPING

ISO_27001_A_8_9 (Configuration Management)

REQUIREMENT: changes to configurations shall be authorized, documented, and verified
MAPPING:
- authorized: git commit history shows who changed what
- documented: AGENT-REGISTRY.json changes tracked in git
- verified: drift detection confirms actual matches declared

CHECK: every config change has a corresponding git commit
IF: config drift detected without git commit THEN: UNAUTHORIZED CHANGE — escalate immediately
DRIFT_SEVERITY: CRITICAL

SOC_2_CC8_1 (Change Management)

REQUIREMENT: changes to infrastructure and software are authorized and managed
MAPPING:
- ge-orchestrator/main.py enforces work routing — no ad-hoc task dispatch
- cost_gate.py enforces execution budgets — no unlimited spending
- container image rebuild is the only deploy path — no kubectl cp
- AGENT-REGISTRY.json is the agent SSOT — no phantom agents

CHECK: are there running processes not traceable to a managed deployment?
RUN: kubectl get pods -A --field-selector=status.phase=Running -o json | jq '.items[] | {name: .metadata.name, image: .spec.containers[0].image, created: .metadata.creationTimestamp}'
IF: pod running image not matching any known build THEN: unauthorized deployment

SOC_2_CC7_2 (Monitoring)

REQUIREMENT: anomalies detected and evaluated
MAPPING:
- cost_gate.py monitors per-session, per-agent, and daily costs
- hook loop prevention detects recursive trigger patterns
- stream length monitoring detects unbounded growth
- drift detection catches config changes

RULE: all anomalies must be logged to session_learnings table
RULE: CRITICAL and HIGH anomalies must be escalated to mira within 5 minutes
RULE: maintain audit trail of all drift detections and resolutions


SYSTEM_INTEGRITY:UNAUTHORIZED_CHANGE_DETECTION

DETECTION_TECHNIQUES

TECHNIQUE: filesystem integrity monitoring

1. compute sha256 checksums of critical config files
2. store checksums in DB as baseline
3. periodically recompute and compare
4. any mismatch = potential unauthorized change

CRITICAL_FILES to checksum:
- config/ports.yaml
- config/dolly-routing.yaml
- config/agent-execution.yaml
- config/constitution.md
- config/post-completion-hooks.yaml
- ge-ops/master/AGENT-REGISTRY.json

TOOL: sha256sum
RUN: sha256sum /home/claude/ge-bootstrap/config/ports.yaml

TECHNIQUE: git-based change detection
RUN: git status --porcelain — any uncommitted changes to config files
RUN: git log --since="1 hour ago" --name-only -- config/ — recent config changes
IF: uncommitted config changes exist THEN: either commit or investigate

TECHNIQUE: k8s audit logging
RUN: kubectl get events -A --sort-by=.lastTimestamp | head -50
CHECK: unexpected DELETE, PATCH, or UPDATE events on critical resources

ANTI_PATTERN: running integrity checks from inside the system being monitored
FIX: use independent baseline (git history) as ground truth, not self-reported state

ANTI_PATTERN: alerting on every change without context
FIX: correlate with git commits and deployment events — authorized changes have a paper trail

RESPONSE_PROTOCOL

ON_CRITICAL_DRIFT:
1. capture full state snapshot (kubectl, redis-cli, checksums)
2. log to session_learnings with scope:system and severity:critical
3. escalate to mira via admin-ui API discussion
4. IF auto-remediation is safe (e.g., restore file from git) THEN: remediate and log
5. ELSE: HALT affected component, wait for human decision

ON_HIGH_DRIFT:
1. log to session_learnings
2. notify ron (self) for investigation
3. IF root cause identified within 15 minutes THEN: remediate
4. ELSE: escalate to mira

ON_MEDIUM_DRIFT:
1. log to session_learnings
2. add to daily drift report
3. remediate in next maintenance window

ON_LOW_DRIFT:
1. log only
2. include in weekly summary


SYSTEM_INTEGRITY:TOOLS_AND_COMMANDS

K8S_INSPECTION

TOOL: kubectl
RUN: kubectl diff -f k8s/base/ — declared vs live state
RUN: kubectl get pods -A -o wide — all pods with node info
RUN: kubectl top pods -A — actual resource usage
RUN: kubectl get events -A --sort-by=.lastTimestamp — recent events
RUN: kubectl get networkpolicies -A — network isolation rules

REDIS_INSPECTION

TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: INFO memory — memory usage
RUN: INFO clients — connected clients
RUN: XINFO STREAM ge:work:incoming — system stream info
RUN: DBSIZE — total key count
RUN: SLOWLOG GET 10 — recent slow commands

FILE_INTEGRITY

TOOL: sha256sum, stat, git
RUN: sha256sum config/*.yaml config/*.md — config checksums
RUN: git diff -- config/ — uncommitted config changes
RUN: git log --oneline -10 -- config/ — recent config history

HEALTH_AGGREGATION

TOOL: scripts/k8s-health-dump.sh
PURPOSE: host cron runs this to dump cluster health to public/k8s-health.json
NOTE: used by admin-ui infrastructure page (avoids broken ClusterIP from inside pods)
RUN: bash scripts/k8s-health-dump.sh