Skip to content

Session Handoff — 2026-04-21 (Governance marathon — 7 MRs + F-a deployment + probe-before-classify pitfall named)

Landed on branches (awaiting DJ merge)

7 MRs open, stacked 4-deep with 3 sibling branches:

MR Target Title Commits
!28 main docs(governance): 2026-04-20 follow-ups — cutover + INC-A + F-ARGOCD 8
!29 main docs(evidence): F-2026-0421 — backup bucket name mismatch (MEDIUM, latent) 2
!30 main chore(infra): F-a k3s state.db snapshots — AND-gate-2 unblocker 1
!31 main chore(bridge-removal): remove pg-emergency-dump — INC-20260418-B closure 2
!32 !28 branch docs(evidence): INC-A HIGH→MEDIUM + F-gitlab-bridge-silent-degrade 2
!33 !32 branch docs(governance): resolve 2026-04-21 DJ decisions (A/B/C/D/E/F) 2
!34 !33 branch chore(registry): add generic-executor identity — DEC-20260421-E.2 1

Stack order for merging: !28 → !32 → !33 → !34 (parent-before-child); !29, !30, !31 independent.

Executor actions post-merge (α-authorized, awaiting merge)

  • MR !31 Phase 1kubectl -n ge-gitlab delete cronjob pg-emergency-dump (trivially reversible)
  • MR !31 Phase 2 (+24h)kubectl -n ge-gitlab delete pvc pg-emergency-dump (loses 14-day pg_dumpall history; redundant vs now-working chart-installed backup)
  • MR !32 annotation purgekubectl -n ge-agents annotate secret gitlab-ci-bridge-token kubectl.kubernetes.io/last-applied-configuration- (closes CLAUDE.md §Vault compliance gap; token value is inert since 2026-04-02)

All three routed α/β/γ pattern in their respective MR bodies; reviewer ACK'd α for all three.

Infrastructure deltas live (executed this session)

  • UpCloud Object Storage 2.0 service ge-k3s-state-snapshots (uuid 12ff6556-ee75-41fa-ab7d-9c92c411ef82, region europe-1, endpoint vtilc.upcloudobjects.com) with scoped IAM user k3s-snapshot + policy + bucket ge-k3s-state-snapshots. New external dependency.
  • Vault path secret/ge/k3s-snapshots (v2) — new credentials + metadata entry for F-a cron
  • K8s CronJob ge-system/k3s-state-db-snapshot + NetworkPolicy k3s-state-db-snapshot-egress + Secret k3s-snapshot-upcloud-creds — live; first snapshot (10.3 MB gzipped from 198 MB raw state.db) verified 2026-04-21 08:39:34 UTC
  • Governance PAT rotationsecret/ge/gitlab-pat v3; old PAT id=5 revoked; new PAT id=7 active with 2026-07-20 expiry
  • AGENT-REGISTRY.json gains generic-executor entry (via MR !34, awaiting merge)

Incidents + findings + decisions (this session)

Type ID Status
INC INC-20260420-A Reclassified HIGH→MEDIUM 2026-04-21 (probe-verified inert — token expired 2026-04-02, not live as initially classified)
F F-2026-0421-BACKUP-BUCKET-NAME-MISMATCH MEDIUM, latent — surfaced during gate-3 cold-read
F F-2026-0421-GITLAB-BRIDGE-SILENT-DEGRADATION MEDIUM + names probe-before-classify pitfall
F F-2026-0420-ARGOCD-AUTO-ROTATE MEDIUM governance — routing-commitment violation
EVO EVO-2026-0420-004 HIGH operational — primary-worktree drift stub
EVO EVO-2026-0420-005 + amendment Attribution cutover executed (Executor→Abby split)
DEC DEC-20260420-attribution-cutover Filed
DEC DEC-20260421-A/B/C/D/E/F 6 DECs ratified via DJ delegation to reviewer 2026-04-21

Pitfalls added today

  • pitfalls/probe-before-classify.md — load-bearing new pattern; 3 instances documented (2 reviewer-side + 1 executor-side) all corrected via primary-source verification
  • pitfalls/dec-filename-source-letter.md — micro-precedent (≥3 DECs from graduated brief → include letter in filename)
  • pitfalls/agent-registry-canonical-diff.md — micro-precedent (jq canonical-diff backstop for registry amendments)

Bridge-removal saga — CLOSED (this session)

74-day zero-backup gap (INC-20260418-B, started 2026-02-03) closed end-to-end: - Chart-installed backup working (2026-04-18 via EVO-001) - Canary-verify V1-V5 PASS (2026-04-20 via MR !27) - Gate-3 cold-read of 2026-04-21 00:04 UTC scheduled cycle: 5.36 GB tar, Complete - AND-gate-2 satisfied by F-a deployment (MR !30) after executor pushback on reviewer's 3-gate cold-read (which omitted AND-gate-2) - Bridge removal MR !31 awaiting merge; phased cluster-side delete α-authorized

Top of next session

  1. Confirm DJ merge chain!28 → !32 → !33 → !34 in order + !29 + !30 + !31 independent
  2. Execute α-authorized cluster actions post each relevant merge — paste kubectl outputs in respective MR threads
  3. First post-merge health check — k3s snapshot cron still firing 02:30 UTC daily; second snapshot landing in UpCloud bucket
  4. EVO-2026-0420-002 Track A — scoped ci-bot for ArgoCD (absorbs F-c Vault-CSI-Secret-offload + ge-orchestrator CI-bridge service-restoration)
  5. DEC-20260421-D execution — ARGOCD prior-session spot-check (3 commits: first/middle/last of 8-commit stack on MR !27)
  6. F-b Velero — this sprint per DEC-20260421-F timing

Carry-forward flags (next session pick-list)

  • New decisions-sought bullet (future brief): GitLab user provisioning for DJ (closes F-2026-0420-ARGOCD-AUTO-ROTATE Signal 2 and EVO-2026-0420-005 Parallel-signal gaps) + label taxonomy (MR !28/!29/etc have no labels because no taxonomy exists)
  • fort-knox-dev Vault secondary — operational-state investigation (30-min inspection; is it real, aspirational, or stale?)
  • Bunny Edge Storage — as second off-platform backup destination (EVO candidate: backup-destination-diversification; once F-d HA k3s lands, correlated-failure concern on single-vendor UpCloud becomes tractable)
  • F-b / F-c / F-d execution — per DEC-20260421-F timing; F-d is production-reactivation blocker
  • First ambiguous-role session — audit generic-executor fallback (semantic-soundness check on first real use)
  • Retroactive classification audit — inventory open INCs/F/EVOs with severity labels, probe-verify each current-state premise
  • Health-sweep v2 — probe consumer-side token validity on pods with optional: true Secret refs (CLAUDE.md §HEALTH SWEEP FIRST scope expansion candidate)
  • Readiness-probe pattern for token-bearing consumers — fail readiness on 401 → Pod NotReady → alert (generalizable fix for silent-degradation class)
  • Coverage assertions for backup Jobs — declare expected sub-step outcomes; fail Job on mismatch (generalizable fix for F-2026-0421-BACKUP-BUCKET-NAME-MISMATCH class)
  • Bucket-lifecycle policy on ge-k3s-state-snapshots — 30-day TTL per DEC-20260421-F intent; deferred from F-a critical path
  • Pre-commit hook / CI lint for AGENT-REGISTRY.json canonical-diff verification (per pitfalls/agent-registry-canonical-diff.md §Candidate automation)

Connection info (unchanged from prior sessions)

Component How to connect
UpCloud K8s API (acceptance) KUBECONFIG=~/.kube/upcloud-acceptance.yaml kubectl-standalone get nodes
UpCloud Vault (acceptance) kubectl-standalone --kubeconfig=~/.kube/upcloud-acceptance.yaml exec -n vault vault-0 -- vault status
UpCloud API Bearer token in Vault at secret/ge/upcloud
UpCloud Object Storage (k3s snapshots) Endpoint https://vtilc.upcloudobjects.com; creds at Vault secret/ge/k3s-snapshots v2; scoped IAM user k3s-snapshot (PutObject + ListBucket + GetBucketLocation + GetObject only)
Bunny API API key in Vault at secret/ge/bunny
TransIP API Credentials in Vault at secret/ge/transip (registrar + client domain registration)
Acceptance Vault creds Root token + unseal keys in fort-knox-dev Vault at secret/ge/upcloud-acceptance-vault
GitLab admin PAT Vault secret/ge/gitlab-pat v3; PAT id=7, expires 2026-07-20, scopes api, read_api, read_repository, write_repository

Key Learnings

The Key Learnings section is append-only. Each learning was captured from a real incident or mistake. Overwriting them erases institutional memory. See memory/feedback_handoff_append_only for the rule.

2026-04-18 (preserved from prior handoff)

  1. --reuse-values is an IaC anti-pattern — makes helm permanent-memory of any --set ever applied, decoupling live state from git. INC-20260418-C.
  2. Default-enabled ≠ actually-enabled — chart cronjobs can fail silently for 74 days. INC-20260418-B.
  3. Reviewer-execution loop is load-bearing — per-decision cold-read pushback is a governance primitive. Today caught: Linkerd assumption in EVO-001; Option-B-rejection framing; drift classification (a/b/c); day-one gate-activation trap in EVO-007; missing attribution investigation → EVO-008.
  4. Pre-flight discipline pays — three 74-day-class findings surfaced in one session from running pre-flight checks, not waiting for something to break.
  5. K3s default ships with Kubernetes audit logging disabled — attribution of cluster mutations since 2026-02-03 is impossible for the drift window. EVO-008.

2026-04-21 (appended)

  1. Probe before classify — reviewer decisions on security/governance artifacts must verify current artifact state, not rely on classification at filing time. Three instances this session (2 reviewer + 1 executor), all corrected via primary-source verification. Pattern generalizes across severity / gate-status / consumer-state / credential-validity / vendor-trust. See pitfalls/probe-before-classify.md.
  2. DEC filenames from graduated briefs include source-bullet letter — when ≥3 DECs land same-day from a single decisions-sought brief, include the letter in filename (DEC-YYYYMMDD-<LETTER>-<slug>.md) for chain-of-custody back-reference. See pitfalls/dec-filename-source-letter.md.
  3. AGENT-REGISTRY.json amendments require canonical-diff verificationjq -S '.agents | del(."<key>")' pre vs post must be empty. Stronger than git diff-minimality per CLAUDE.md §AGENT REGISTRY SAFETY ("setting an agent to unavailable silently breaks their ability to receive work"). See pitfalls/agent-registry-canonical-diff.md.
  4. Silent service degradation is still #1 operational risk (CLAUDE.md §HEALTH SWEEP FIRST premise confirmed) — ge-orchestrator CI-bridge silently no-op for 19 days because optional: true Secret ref + ci_bridge.py:139 fallback-skip log masked expired token. Same pattern class as F-2026-0421-BACKUP-BUCKET-NAME-MISMATCH (Job Complete masks exit-12). Generalizable fix: coverage assertions + readiness-probes that validate credentials.
  5. UpCloud Object Storage 2.0 S3 requires AWS_REQUEST_CHECKSUM_CALCULATION=when_required — newer botocore (≥2024) default-enables CRC32 on PUT; UpCloud rejects with XAmzContentSHA256Mismatch. Set env var on the aws-cli client (env-from Secret in CronJob), or use equivalent SDK config.
  6. Cluster-path ArgoCD-management inventory is not comprehensivek8s/base/gitlab/ and by extension the pg-emergency-dump bridge were NEVER ArgoCD-managed; only k8s/base/{agents,hosting,monitoring,core} have Applications. Assuming "everything in k8s/base is managed" is wrong. Probe kubectl get applications.argoproj.io -n argocd before assuming a path is covered.
  7. Stacked MRs are the GE pattern for dependent docs changes — 4-deep stack this session (!28 → !32 → !33 → !34) works cleanly in GitLab by targeting parent branch. Merge order must follow parent-before-child.