DOMAIN:SYSTEM_INTEGRITY¶
OWNER: ron
UPDATED: 2026-03-24
SCOPE: configuration drift detection, system state verification, compliance monitoring
AGENTS: ron (System Integrity Monitor), mira (Escalation Manager)
SYSTEM_INTEGRITY:CONFIGURATION_DRIFT_DETECTION¶
PURPOSE: detect unauthorized or accidental changes to system configuration by comparing actual state vs declared state
WHAT_TO_CHECK¶
RULE: every check compares ACTUAL (runtime state) vs EXPECTED (config file / DB / manifest)
K8S_RESOURCE_LIMITS¶
TOOL: kubectl
RUN: kubectl get deployments -A -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, replicas: .spec.replicas, cpu_limit: .spec.template.spec.containers[0].resources.limits.cpu, mem_limit: .spec.template.spec.containers[0].resources.limits.memory}'
CHECK: deployment replicas match expected count
CHECK: resource limits match declared values in manifests under k8s/base/
CHECK: HPA maxReplicas <= 5 for executor (BINDING — see CLAUDE.md)
CHECK: container image tag matches expected (no :latest drift from rebuild)
EXPECTED_SOURCE: k8s manifest files in k8s/base/agents/, k8s/base/system/
DRIFT_SEVERITY: HIGH if HPA maxReplicas > 5 (cost burn risk)
DRIFT_SEVERITY: MEDIUM if resource limits differ from manifest
DRIFT_SEVERITY: LOW if replica count temporarily differs (HPA scaling is normal)
REDIS_STREAM_HEALTH¶
TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: redis-cli -p 6381 -a $REDIS_PASSWORD INFO memory — check used_memory_rss
RUN: redis-cli -p 6381 -a $REDIS_PASSWORD XLEN ge:work:incoming
RUN: for each agent: redis-cli -p 6381 -a $REDIS_PASSWORD XLEN triggers.{agent}
CHECK: stream length exceeds MAXLEN thresholds (100/agent, 1000/system)
IF: exceeded THEN: CRITICAL — XADD without MAXLEN is active, find source immediately
CHECK: memory usage > 80% of allocated
IF: yes THEN: WARNING — streams may need trimming, check for abandoned consumers
EXPECTED_SOURCE: MAXLEN values defined in code (task-service.ts, ge-orchestrator/main.py)
DRIFT_SEVERITY: CRITICAL if MAXLEN violated (unbounded growth = OOM risk)
DRIFT_SEVERITY: HIGH if memory > 80%
DRIFT_SEVERITY: LOW if stream lengths elevated but within bounds
CRONJOB_INVENTORY¶
TOOL: kubectl
RUN: kubectl get cronjobs -A -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, schedule: .spec.schedule, suspended: .spec.suspend, lastSchedule: .status.lastScheduleTime}'
CHECK: all expected CronJobs exist and are not suspended
CHECK: no unexpected CronJobs have been added
CHECK: schedule matches declared values
CHECK: lastScheduleTime is recent (not silently failing)
EXPECTED_SOURCE: declared CronJob manifests in k8s/base/
DRIFT_SEVERITY: HIGH if CronJob suspended without record (silent failure)
DRIFT_SEVERITY: MEDIUM if schedule changed
DRIFT_SEVERITY: LOW if new CronJob added (may be intentional)
FILE_PERMISSIONS¶
TOOL: stat, find
RUN: stat -c '%a %U %G %n' /home/claude/ge-bootstrap/config/*.yaml
RUN: stat -c '%a %U %G %n' /home/claude/ge-bootstrap/ge-ops/master/AGENT-REGISTRY.json
CHECK: config files are owned by claude:claude
CHECK: config files are not world-writable (no xx7 permissions)
CHECK: AGENT-REGISTRY.json permissions haven't changed
CHECK: no setuid/setgid bits on any GE files
EXPECTED_SOURCE: initial deployment permissions (captured at install time)
DRIFT_SEVERITY: CRITICAL if world-writable config files (security breach vector)
DRIFT_SEVERITY: HIGH if ownership changed
DRIFT_SEVERITY: LOW if group permissions differ
CONTAINER_IMAGE_VERSIONS¶
TOOL: kubectl, k3s
RUN: kubectl get pods -A -o json | jq '.items[] | {namespace: .metadata.namespace, pod: .metadata.name, containers: [.spec.containers[] | {name: .name, image: .image}]}'
RUN: sudo k3s ctr images ls | grep ge-bootstrap
CHECK: running image digest matches last built image
CHECK: no pods running stale images after a rollout
CHECK: imagePullPolicy is correct (Never for local builds, Always for registry)
EXPECTED_SOURCE: last build-executor.sh run timestamp vs pod creation timestamp
DRIFT_SEVERITY: HIGH if pod running image older than last build (stale deployment)
DRIFT_SEVERITY: MEDIUM if imagePullPolicy mismatch
REGISTRY_DB_SYNC¶
TOOL: psql or admin-ui API
RUN: compare AGENT-REGISTRY.json agent count and status vs agents table in PostgreSQL
CHECK: every agent in AGENT-REGISTRY.json exists in DB
CHECK: status field matches between JSON and DB
CHECK: provider/providerModel matches between JSON and DB
CHECK: no phantom agents in DB that don't exist in registry
EXPECTED_SOURCE: AGENT-REGISTRY.json is source of truth for agent definitions
DRIFT_SEVERITY: HIGH if agent status differs (agent may silently not receive work)
DRIFT_SEVERITY: MEDIUM if provider config differs (wrong LLM will be used)
DRIFT_SEVERITY: LOW if metadata fields differ
DRIFT_DETECTION_METHODOLOGY¶
RULE: always compare ACTUAL vs EXPECTED — never rely on "looks right"
RULE: capture baseline snapshots on every deployment
RULE: run drift detection on schedule (every 15 minutes for critical, hourly for medium, daily for low)
TECHNIQUE: snapshot-and-diff
1. capture current state as JSON snapshot
2. compare against last known-good baseline
3. classify each difference by severity
4. store diff in session_learnings if drift detected
5. auto-remediate LOW severity (log only)
6. alert for MEDIUM severity (notify ron)
7. escalate for HIGH/CRITICAL (notify mira → human)
TOOL: kubectl diff -f k8s/base/ — compare declared manifests vs live state
NOTE: kubectl diff exits 0 if no diff, 1 if diff exists, >1 if error
ANTI_PATTERN: checking drift only when something breaks
FIX: scheduled drift detection catches problems before they cause incidents
ANTI_PATTERN: storing baseline snapshots in memory only
FIX: persist baselines to DB or filesystem — agent restarts should not lose baseline
DRIFT_SEVERITY_CLASSIFICATION¶
CRITICAL (security-related)¶
- file permissions allow unauthorized access
- secrets exposed in environment variables or logs
- RBAC roles changed without authorization
- network policies removed or weakened
- TLS certificates expired or downgraded
- container running as root when it shouldn't
HIGH (operational risk)¶
- HPA maxReplicas exceeded safe limit (5)
- MAXLEN not enforced on Redis streams
- cost_gate bypassed or disabled
- agent status changed to unavailable without record
- CronJob suspended without authorization
- container image stale after deployment
MEDIUM (resource/config)¶
- resource limits differ from manifest
- CronJob schedule changed
- provider config mismatch between registry and DB
- Redis memory usage > 80%
- pod restart count > 10 in last hour
LOW (cosmetic/informational)¶
- replica count differs (HPA normal operation)
- metadata field mismatch
- log level changed
- label/annotation differences
SYSTEM_INTEGRITY:COMPLIANCE_MAPPING¶
ISO_27001_A_8_9 (Configuration Management)¶
REQUIREMENT: changes to configurations shall be authorized, documented, and verified
MAPPING:
- authorized: git commit history shows who changed what
- documented: AGENT-REGISTRY.json changes tracked in git
- verified: drift detection confirms actual matches declared
CHECK: every config change has a corresponding git commit
IF: config drift detected without git commit THEN: UNAUTHORIZED CHANGE — escalate immediately
DRIFT_SEVERITY: CRITICAL
SOC_2_CC8_1 (Change Management)¶
REQUIREMENT: changes to infrastructure and software are authorized and managed
MAPPING:
- ge-orchestrator/main.py enforces work routing — no ad-hoc task dispatch
- cost_gate.py enforces execution budgets — no unlimited spending
- container image rebuild is the only deploy path — no kubectl cp
- AGENT-REGISTRY.json is the agent SSOT — no phantom agents
CHECK: are there running processes not traceable to a managed deployment?
RUN: kubectl get pods -A --field-selector=status.phase=Running -o json | jq '.items[] | {name: .metadata.name, image: .spec.containers[0].image, created: .metadata.creationTimestamp}'
IF: pod running image not matching any known build THEN: unauthorized deployment
SOC_2_CC7_2 (Monitoring)¶
REQUIREMENT: anomalies detected and evaluated
MAPPING:
- cost_gate.py monitors per-session, per-agent, and daily costs
- hook loop prevention detects recursive trigger patterns
- stream length monitoring detects unbounded growth
- drift detection catches config changes
RULE: all anomalies must be logged to session_learnings table
RULE: CRITICAL and HIGH anomalies must be escalated to mira within 5 minutes
RULE: maintain audit trail of all drift detections and resolutions
SYSTEM_INTEGRITY:UNAUTHORIZED_CHANGE_DETECTION¶
DETECTION_TECHNIQUES¶
TECHNIQUE: filesystem integrity monitoring
1. compute sha256 checksums of critical config files
2. store checksums in DB as baseline
3. periodically recompute and compare
4. any mismatch = potential unauthorized change
CRITICAL_FILES to checksum:
- config/ports.yaml
- config/dolly-routing.yaml
- config/agent-execution.yaml
- config/constitution.md
- config/post-completion-hooks.yaml
- ge-ops/master/AGENT-REGISTRY.json
TOOL: sha256sum
RUN: sha256sum /home/claude/ge-bootstrap/config/ports.yaml
TECHNIQUE: git-based change detection
RUN: git status --porcelain — any uncommitted changes to config files
RUN: git log --since="1 hour ago" --name-only -- config/ — recent config changes
IF: uncommitted config changes exist THEN: either commit or investigate
TECHNIQUE: k8s audit logging
RUN: kubectl get events -A --sort-by=.lastTimestamp | head -50
CHECK: unexpected DELETE, PATCH, or UPDATE events on critical resources
ANTI_PATTERN: running integrity checks from inside the system being monitored
FIX: use independent baseline (git history) as ground truth, not self-reported state
ANTI_PATTERN: alerting on every change without context
FIX: correlate with git commits and deployment events — authorized changes have a paper trail
RESPONSE_PROTOCOL¶
ON_CRITICAL_DRIFT:
1. capture full state snapshot (kubectl, redis-cli, checksums)
2. log to session_learnings with scope:system and severity:critical
3. escalate to mira via admin-ui API discussion
4. IF auto-remediation is safe (e.g., restore file from git) THEN: remediate and log
5. ELSE: HALT affected component, wait for human decision
ON_HIGH_DRIFT:
1. log to session_learnings
2. notify ron (self) for investigation
3. IF root cause identified within 15 minutes THEN: remediate
4. ELSE: escalate to mira
ON_MEDIUM_DRIFT:
1. log to session_learnings
2. add to daily drift report
3. remediate in next maintenance window
ON_LOW_DRIFT:
1. log only
2. include in weekly summary
SYSTEM_INTEGRITY:TOOLS_AND_COMMANDS¶
K8S_INSPECTION¶
TOOL: kubectl
RUN: kubectl diff -f k8s/base/ — declared vs live state
RUN: kubectl get pods -A -o wide — all pods with node info
RUN: kubectl top pods -A — actual resource usage
RUN: kubectl get events -A --sort-by=.lastTimestamp — recent events
RUN: kubectl get networkpolicies -A — network isolation rules
REDIS_INSPECTION¶
TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: INFO memory — memory usage
RUN: INFO clients — connected clients
RUN: XINFO STREAM ge:work:incoming — system stream info
RUN: DBSIZE — total key count
RUN: SLOWLOG GET 10 — recent slow commands
FILE_INTEGRITY¶
TOOL: sha256sum, stat, git
RUN: sha256sum config/*.yaml config/*.md — config checksums
RUN: git diff -- config/ — uncommitted config changes
RUN: git log --oneline -10 -- config/ — recent config history
HEALTH_AGGREGATION¶
TOOL: scripts/k8s-health-dump.sh
PURPOSE: host cron runs this to dump cluster health to public/k8s-health.json
NOTE: used by admin-ui infrastructure page (avoids broken ClusterIP from inside pods)
RUN: bash scripts/k8s-health-dump.sh