DOMAIN:SYSTEM_INTEGRITY¶

OWNER: ron
UPDATED: 2026-03-24
SCOPE: configuration drift detection, system state verification, compliance monitoring
AGENTS: ron (System Integrity Monitor), mira (Escalation Manager)

SYSTEM_INTEGRITY:CONFIGURATION_DRIFT_DETECTION¶

PURPOSE: detect unauthorized or accidental changes to system configuration by comparing actual state vs declared state

WHAT_TO_CHECK¶

RULE: every check compares ACTUAL (runtime state) vs EXPECTED (config file / DB / manifest)

K8S_RESOURCE_LIMITS¶

TOOL: kubectl
RUN: kubectl get deployments -A -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, replicas: .spec.replicas, cpu_limit: .spec.template.spec.containers[0].resources.limits.cpu, mem_limit: .spec.template.spec.containers[0].resources.limits.memory}'

CHECK: deployment replicas match expected count
CHECK: resource limits match declared values in manifests under k8s/base/
CHECK: HPA maxReplicas <= 5 for executor (BINDING — see CLAUDE.md)
CHECK: container image tag matches expected (no :latest drift from rebuild)

EXPECTED_SOURCE: k8s manifest files in k8s/base/agents/, k8s/base/system/
DRIFT_SEVERITY: HIGH if HPA maxReplicas > 5 (cost burn risk)
DRIFT_SEVERITY: MEDIUM if resource limits differ from manifest
DRIFT_SEVERITY: LOW if replica count temporarily differs (HPA scaling is normal)

REDIS_STREAM_HEALTH¶

TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: redis-cli -p 6381 -a $REDIS_PASSWORD INFO memory — check used_memory_rss
RUN: redis-cli -p 6381 -a $REDIS_PASSWORD XLEN ge:work:incoming
RUN: for each agent: redis-cli -p 6381 -a $REDIS_PASSWORD XLEN triggers.{agent}

CHECK: stream length exceeds MAXLEN thresholds (100/agent, 1000/system)
IF: exceeded THEN: CRITICAL — XADD without MAXLEN is active, find source immediately
CHECK: memory usage > 80% of allocated
IF: yes THEN: WARNING — streams may need trimming, check for abandoned consumers

EXPECTED_SOURCE: MAXLEN values defined in code (task-service.ts, ge-orchestrator/main.py)
DRIFT_SEVERITY: CRITICAL if MAXLEN violated (unbounded growth = OOM risk)
DRIFT_SEVERITY: HIGH if memory > 80%
DRIFT_SEVERITY: LOW if stream lengths elevated but within bounds

CRONJOB_INVENTORY¶

TOOL: kubectl
RUN: kubectl get cronjobs -A -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, schedule: .spec.schedule, suspended: .spec.suspend, lastSchedule: .status.lastScheduleTime}'

CHECK: all expected CronJobs exist and are not suspended
CHECK: no unexpected CronJobs have been added
CHECK: schedule matches declared values
CHECK: lastScheduleTime is recent (not silently failing)

EXPECTED_SOURCE: declared CronJob manifests in k8s/base/
DRIFT_SEVERITY: HIGH if CronJob suspended without record (silent failure)
DRIFT_SEVERITY: MEDIUM if schedule changed
DRIFT_SEVERITY: LOW if new CronJob added (may be intentional)

FILE_PERMISSIONS¶

TOOL: stat, find
RUN: stat -c '%a %U %G %n' /home/claude/ge-bootstrap/config/*.yaml
RUN: stat -c '%a %U %G %n' /home/claude/ge-bootstrap/ge-ops/master/AGENT-REGISTRY.json

CHECK: config files are owned by claude:claude
CHECK: config files are not world-writable (no xx7 permissions)
CHECK: AGENT-REGISTRY.json permissions haven't changed
CHECK: no setuid/setgid bits on any GE files

EXPECTED_SOURCE: initial deployment permissions (captured at install time)
DRIFT_SEVERITY: CRITICAL if world-writable config files (security breach vector)
DRIFT_SEVERITY: HIGH if ownership changed
DRIFT_SEVERITY: LOW if group permissions differ

CONTAINER_IMAGE_VERSIONS¶

TOOL: kubectl, k3s
RUN: kubectl get pods -A -o json | jq '.items[] | {namespace: .metadata.namespace, pod: .metadata.name, containers: [.spec.containers[] | {name: .name, image: .image}]}'
RUN: sudo k3s ctr images ls | grep ge-bootstrap

CHECK: running image digest matches last built image
CHECK: no pods running stale images after a rollout
CHECK: imagePullPolicy is correct (Never for local builds, Always for registry)

EXPECTED_SOURCE: last build-executor.sh run timestamp vs pod creation timestamp
DRIFT_SEVERITY: HIGH if pod running image older than last build (stale deployment)
DRIFT_SEVERITY: MEDIUM if imagePullPolicy mismatch

REGISTRY_DB_SYNC¶

TOOL: psql or admin-ui API
RUN: compare AGENT-REGISTRY.json agent count and status vs agents table in PostgreSQL

CHECK: every agent in AGENT-REGISTRY.json exists in DB
CHECK: status field matches between JSON and DB
CHECK: provider/providerModel matches between JSON and DB
CHECK: no phantom agents in DB that don't exist in registry

EXPECTED_SOURCE: AGENT-REGISTRY.json is source of truth for agent definitions
DRIFT_SEVERITY: HIGH if agent status differs (agent may silently not receive work)
DRIFT_SEVERITY: MEDIUM if provider config differs (wrong LLM will be used)
DRIFT_SEVERITY: LOW if metadata fields differ

DRIFT_DETECTION_METHODOLOGY¶

RULE: always compare ACTUAL vs EXPECTED — never rely on "looks right"
RULE: capture baseline snapshots on every deployment
RULE: run drift detection on schedule (every 15 minutes for critical, hourly for medium, daily for low)

TECHNIQUE: snapshot-and-diff

1. capture current state as JSON snapshot
2. compare against last known-good baseline
3. classify each difference by severity
4. store diff in session_learnings if drift detected
5. auto-remediate LOW severity (log only)
6. alert for MEDIUM severity (notify ron)
7. escalate for HIGH/CRITICAL (notify mira → human)

TOOL: kubectl diff -f k8s/base/ — compare declared manifests vs live state
NOTE: kubectl diff exits 0 if no diff, 1 if diff exists, >1 if error

ANTI_PATTERN: checking drift only when something breaks
FIX: scheduled drift detection catches problems before they cause incidents

ANTI_PATTERN: storing baseline snapshots in memory only
FIX: persist baselines to DB or filesystem — agent restarts should not lose baseline

DRIFT_SEVERITY_CLASSIFICATION¶

file permissions allow unauthorized access
secrets exposed in environment variables or logs
RBAC roles changed without authorization
network policies removed or weakened
TLS certificates expired or downgraded
container running as root when it shouldn't

HIGH (operational risk)¶

HPA maxReplicas exceeded safe limit (5)
MAXLEN not enforced on Redis streams
cost_gate bypassed or disabled
agent status changed to unavailable without record
CronJob suspended without authorization
container image stale after deployment

MEDIUM (resource/config)¶

resource limits differ from manifest
CronJob schedule changed
provider config mismatch between registry and DB
Redis memory usage > 80%
pod restart count > 10 in last hour

LOW (cosmetic/informational)¶

replica count differs (HPA normal operation)
metadata field mismatch
log level changed
label/annotation differences

SYSTEM_INTEGRITY:COMPLIANCE_MAPPING¶

ISO_27001_A_8_9 (Configuration Management)¶

REQUIREMENT: changes to configurations shall be authorized, documented, and verified
MAPPING:
- authorized: git commit history shows who changed what
- documented: AGENT-REGISTRY.json changes tracked in git
- verified: drift detection confirms actual matches declared

CHECK: every config change has a corresponding git commit
IF: config drift detected without git commit THEN: UNAUTHORIZED CHANGE — escalate immediately
DRIFT_SEVERITY: CRITICAL

SOC_2_CC8_1 (Change Management)¶

REQUIREMENT: changes to infrastructure and software are authorized and managed
MAPPING:
- ge-orchestrator/main.py enforces work routing — no ad-hoc task dispatch
- cost_gate.py enforces execution budgets — no unlimited spending
- container image rebuild is the only deploy path — no kubectl cp
- AGENT-REGISTRY.json is the agent SSOT — no phantom agents

CHECK: are there running processes not traceable to a managed deployment?
RUN: kubectl get pods -A --field-selector=status.phase=Running -o json | jq '.items[] | {name: .metadata.name, image: .spec.containers[0].image, created: .metadata.creationTimestamp}'
IF: pod running image not matching any known build THEN: unauthorized deployment

SOC_2_CC7_2 (Monitoring)¶

REQUIREMENT: anomalies detected and evaluated
MAPPING:
- cost_gate.py monitors per-session, per-agent, and daily costs
- hook loop prevention detects recursive trigger patterns
- stream length monitoring detects unbounded growth
- drift detection catches config changes

RULE: all anomalies must be logged to session_learnings table
RULE: CRITICAL and HIGH anomalies must be escalated to mira within 5 minutes
RULE: maintain audit trail of all drift detections and resolutions

SYSTEM_INTEGRITY:UNAUTHORIZED_CHANGE_DETECTION¶

DETECTION_TECHNIQUES¶

TECHNIQUE: filesystem integrity monitoring

1. compute sha256 checksums of critical config files
2. store checksums in DB as baseline
3. periodically recompute and compare
4. any mismatch = potential unauthorized change

CRITICAL_FILES to checksum:
- config/ports.yaml
- config/dolly-routing.yaml
- config/agent-execution.yaml
- config/constitution.md
- config/post-completion-hooks.yaml
- ge-ops/master/AGENT-REGISTRY.json

TOOL: sha256sum
RUN: sha256sum /home/claude/ge-bootstrap/config/ports.yaml

TECHNIQUE: git-based change detection
RUN: git status --porcelain — any uncommitted changes to config files
RUN: git log --since="1 hour ago" --name-only -- config/ — recent config changes
IF: uncommitted config changes exist THEN: either commit or investigate

TECHNIQUE: k8s audit logging
RUN: kubectl get events -A --sort-by=.lastTimestamp | head -50
CHECK: unexpected DELETE, PATCH, or UPDATE events on critical resources

ANTI_PATTERN: running integrity checks from inside the system being monitored
FIX: use independent baseline (git history) as ground truth, not self-reported state

ANTI_PATTERN: alerting on every change without context
FIX: correlate with git commits and deployment events — authorized changes have a paper trail

RESPONSE_PROTOCOL¶

ON_CRITICAL_DRIFT:
1. capture full state snapshot (kubectl, redis-cli, checksums)
2. log to session_learnings with scope:system and severity:critical
3. escalate to mira via admin-ui API discussion
4. IF auto-remediation is safe (e.g., restore file from git) THEN: remediate and log
5. ELSE: HALT affected component, wait for human decision

ON_HIGH_DRIFT:
1. log to session_learnings
2. notify ron (self) for investigation
3. IF root cause identified within 15 minutes THEN: remediate
4. ELSE: escalate to mira

ON_MEDIUM_DRIFT:
1. log to session_learnings
2. add to daily drift report
3. remediate in next maintenance window

ON_LOW_DRIFT:
1. log only
2. include in weekly summary

SYSTEM_INTEGRITY:TOOLS_AND_COMMANDS¶

K8S_INSPECTION¶

TOOL: kubectl
RUN: kubectl diff -f k8s/base/ — declared vs live state
RUN: kubectl get pods -A -o wide — all pods with node info
RUN: kubectl top pods -A — actual resource usage
RUN: kubectl get events -A --sort-by=.lastTimestamp — recent events
RUN: kubectl get networkpolicies -A — network isolation rules

REDIS_INSPECTION¶

TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: INFO memory — memory usage
RUN: INFO clients — connected clients
RUN: XINFO STREAM ge:work:incoming — system stream info
RUN: DBSIZE — total key count
RUN: SLOWLOG GET 10 — recent slow commands

FILE_INTEGRITY¶

TOOL: sha256sum, stat, git
RUN: sha256sum config/*.yaml config/*.md — config checksums
RUN: git diff -- config/ — uncommitted config changes
RUN: git log --oneline -10 -- config/ — recent config history

HEALTH_AGGREGATION¶

TOOL: scripts/k8s-health-dump.sh
PURPOSE: host cron runs this to dump cluster health to public/k8s-health.json
NOTE: used by admin-ui infrastructure page (avoids broken ClusterIP from inside pods)
RUN: bash scripts/k8s-health-dump.sh