Session Handoff — 2026-04-21 (Governance marathon — 7 MRs + F-a deployment + probe-before-classify pitfall named)¶
Landed on branches (awaiting DJ merge)¶
7 MRs open, stacked 4-deep with 3 sibling branches:
| MR | Target | Title | Commits |
|---|---|---|---|
| !28 | main | docs(governance): 2026-04-20 follow-ups — cutover + INC-A + F-ARGOCD | 8 |
| !29 | main | docs(evidence): F-2026-0421 — backup bucket name mismatch (MEDIUM, latent) | 2 |
| !30 | main | chore(infra): F-a k3s state.db snapshots — AND-gate-2 unblocker | 1 |
| !31 | main | chore(bridge-removal): remove pg-emergency-dump — INC-20260418-B closure | 2 |
| !32 | !28 branch | docs(evidence): INC-A HIGH→MEDIUM + F-gitlab-bridge-silent-degrade | 2 |
| !33 | !32 branch | docs(governance): resolve 2026-04-21 DJ decisions (A/B/C/D/E/F) | 2 |
| !34 | !33 branch | chore(registry): add generic-executor identity — DEC-20260421-E.2 | 1 |
Stack order for merging: !28 → !32 → !33 → !34 (parent-before-child); !29, !30, !31 independent.
Executor actions post-merge (α-authorized, awaiting merge)¶
- MR !31 Phase 1 —
kubectl -n ge-gitlab delete cronjob pg-emergency-dump(trivially reversible) - MR !31 Phase 2 (+24h) —
kubectl -n ge-gitlab delete pvc pg-emergency-dump(loses 14-day pg_dumpall history; redundant vs now-working chart-installed backup) - MR !32 annotation purge —
kubectl -n ge-agents annotate secret gitlab-ci-bridge-token kubectl.kubernetes.io/last-applied-configuration-(closes CLAUDE.md §Vault compliance gap; token value is inert since 2026-04-02)
All three routed α/β/γ pattern in their respective MR bodies; reviewer ACK'd α for all three.
Infrastructure deltas live (executed this session)¶
- UpCloud Object Storage 2.0 service
ge-k3s-state-snapshots(uuid12ff6556-ee75-41fa-ab7d-9c92c411ef82, regioneurope-1, endpointvtilc.upcloudobjects.com) with scoped IAM userk3s-snapshot+ policy + bucketge-k3s-state-snapshots. New external dependency. - Vault path
secret/ge/k3s-snapshots(v2) — new credentials + metadata entry for F-a cron - K8s CronJob
ge-system/k3s-state-db-snapshot+ NetworkPolicyk3s-state-db-snapshot-egress+ Secretk3s-snapshot-upcloud-creds— live; first snapshot (10.3 MB gzipped from 198 MB raw state.db) verified 2026-04-21 08:39:34 UTC - Governance PAT rotation —
secret/ge/gitlab-patv3; old PAT id=5 revoked; new PAT id=7 active with 2026-07-20 expiry - AGENT-REGISTRY.json gains
generic-executorentry (via MR !34, awaiting merge)
Incidents + findings + decisions (this session)¶
| Type | ID | Status |
|---|---|---|
| INC | INC-20260420-A | Reclassified HIGH→MEDIUM 2026-04-21 (probe-verified inert — token expired 2026-04-02, not live as initially classified) |
| F | F-2026-0421-BACKUP-BUCKET-NAME-MISMATCH | MEDIUM, latent — surfaced during gate-3 cold-read |
| F | F-2026-0421-GITLAB-BRIDGE-SILENT-DEGRADATION | MEDIUM + names probe-before-classify pitfall |
| F | F-2026-0420-ARGOCD-AUTO-ROTATE | MEDIUM governance — routing-commitment violation |
| EVO | EVO-2026-0420-004 | HIGH operational — primary-worktree drift stub |
| EVO | EVO-2026-0420-005 + amendment | Attribution cutover executed (Executor→Abby split) |
| DEC | DEC-20260420-attribution-cutover | Filed |
| DEC | DEC-20260421-A/B/C/D/E/F | 6 DECs ratified via DJ delegation to reviewer 2026-04-21 |
Pitfalls added today¶
pitfalls/probe-before-classify.md— load-bearing new pattern; 3 instances documented (2 reviewer-side + 1 executor-side) all corrected via primary-source verificationpitfalls/dec-filename-source-letter.md— micro-precedent (≥3 DECs from graduated brief → include letter in filename)pitfalls/agent-registry-canonical-diff.md— micro-precedent (jq canonical-diff backstop for registry amendments)
Bridge-removal saga — CLOSED (this session)¶
74-day zero-backup gap (INC-20260418-B, started 2026-02-03) closed end-to-end: - Chart-installed backup working (2026-04-18 via EVO-001) - Canary-verify V1-V5 PASS (2026-04-20 via MR !27) - Gate-3 cold-read of 2026-04-21 00:04 UTC scheduled cycle: 5.36 GB tar, Complete - AND-gate-2 satisfied by F-a deployment (MR !30) after executor pushback on reviewer's 3-gate cold-read (which omitted AND-gate-2) - Bridge removal MR !31 awaiting merge; phased cluster-side delete α-authorized
Top of next session¶
- Confirm DJ merge chain —
!28 → !32 → !33 → !34in order +!29+!30+!31independent - Execute α-authorized cluster actions post each relevant merge — paste kubectl outputs in respective MR threads
- First post-merge health check — k3s snapshot cron still firing 02:30 UTC daily; second snapshot landing in UpCloud bucket
- EVO-2026-0420-002 Track A — scoped ci-bot for ArgoCD (absorbs F-c Vault-CSI-Secret-offload + ge-orchestrator CI-bridge service-restoration)
- DEC-20260421-D execution — ARGOCD prior-session spot-check (3 commits: first/middle/last of 8-commit stack on MR !27)
- F-b Velero — this sprint per DEC-20260421-F timing
Carry-forward flags (next session pick-list)¶
- New decisions-sought bullet (future brief): GitLab user provisioning for DJ (closes F-2026-0420-ARGOCD-AUTO-ROTATE Signal 2 and EVO-2026-0420-005 Parallel-signal gaps) + label taxonomy (MR !28/!29/etc have no labels because no taxonomy exists)
- fort-knox-dev Vault secondary — operational-state investigation (30-min inspection; is it real, aspirational, or stale?)
- Bunny Edge Storage — as second off-platform backup destination (EVO candidate: backup-destination-diversification; once F-d HA k3s lands, correlated-failure concern on single-vendor UpCloud becomes tractable)
- F-b / F-c / F-d execution — per DEC-20260421-F timing; F-d is production-reactivation blocker
- First ambiguous-role session — audit generic-executor fallback (semantic-soundness check on first real use)
- Retroactive classification audit — inventory open INCs/F/EVOs with severity labels, probe-verify each current-state premise
- Health-sweep v2 — probe consumer-side token validity on pods with
optional: trueSecret refs (CLAUDE.md §HEALTH SWEEP FIRST scope expansion candidate) - Readiness-probe pattern for token-bearing consumers — fail readiness on 401 → Pod NotReady → alert (generalizable fix for silent-degradation class)
- Coverage assertions for backup Jobs — declare expected sub-step outcomes; fail Job on mismatch (generalizable fix for F-2026-0421-BACKUP-BUCKET-NAME-MISMATCH class)
- Bucket-lifecycle policy on
ge-k3s-state-snapshots— 30-day TTL per DEC-20260421-F intent; deferred from F-a critical path - Pre-commit hook / CI lint for AGENT-REGISTRY.json canonical-diff verification (per
pitfalls/agent-registry-canonical-diff.md§Candidate automation)
Connection info (unchanged from prior sessions)¶
| Component | How to connect |
|---|---|
| UpCloud K8s API (acceptance) | KUBECONFIG=~/.kube/upcloud-acceptance.yaml kubectl-standalone get nodes |
| UpCloud Vault (acceptance) | kubectl-standalone --kubeconfig=~/.kube/upcloud-acceptance.yaml exec -n vault vault-0 -- vault status |
| UpCloud API | Bearer token in Vault at secret/ge/upcloud |
| UpCloud Object Storage (k3s snapshots) | Endpoint https://vtilc.upcloudobjects.com; creds at Vault secret/ge/k3s-snapshots v2; scoped IAM user k3s-snapshot (PutObject + ListBucket + GetBucketLocation + GetObject only) |
| Bunny API | API key in Vault at secret/ge/bunny |
| TransIP API | Credentials in Vault at secret/ge/transip (registrar + client domain registration) |
| Acceptance Vault creds | Root token + unseal keys in fort-knox-dev Vault at secret/ge/upcloud-acceptance-vault |
| GitLab admin PAT | Vault secret/ge/gitlab-pat v3; PAT id=7, expires 2026-07-20, scopes api, read_api, read_repository, write_repository |
Key Learnings¶
The Key Learnings section is append-only. Each learning was captured from a real incident or mistake. Overwriting them erases institutional memory. See memory/feedback_handoff_append_only for the rule.
2026-04-18 (preserved from prior handoff)¶
--reuse-valuesis an IaC anti-pattern — makes helm permanent-memory of any--setever applied, decoupling live state from git. INC-20260418-C.- Default-enabled ≠ actually-enabled — chart cronjobs can fail silently for 74 days. INC-20260418-B.
- Reviewer-execution loop is load-bearing — per-decision cold-read pushback is a governance primitive. Today caught: Linkerd assumption in EVO-001; Option-B-rejection framing; drift classification (a/b/c); day-one gate-activation trap in EVO-007; missing attribution investigation → EVO-008.
- Pre-flight discipline pays — three 74-day-class findings surfaced in one session from running pre-flight checks, not waiting for something to break.
- K3s default ships with Kubernetes audit logging disabled — attribution of cluster mutations since 2026-02-03 is impossible for the drift window. EVO-008.
2026-04-21 (appended)¶
- Probe before classify — reviewer decisions on security/governance artifacts must verify current artifact state, not rely on classification at filing time. Three instances this session (2 reviewer + 1 executor), all corrected via primary-source verification. Pattern generalizes across severity / gate-status / consumer-state / credential-validity / vendor-trust. See
pitfalls/probe-before-classify.md. - DEC filenames from graduated briefs include source-bullet letter — when ≥3 DECs land same-day from a single decisions-sought brief, include the letter in filename (
DEC-YYYYMMDD-<LETTER>-<slug>.md) for chain-of-custody back-reference. Seepitfalls/dec-filename-source-letter.md. - AGENT-REGISTRY.json amendments require canonical-diff verification —
jq -S '.agents | del(."<key>")'pre vs post must be empty. Stronger thangit diff-minimality per CLAUDE.md §AGENT REGISTRY SAFETY ("setting an agent tounavailablesilently breaks their ability to receive work"). Seepitfalls/agent-registry-canonical-diff.md. - Silent service degradation is still #1 operational risk (CLAUDE.md §HEALTH SWEEP FIRST premise confirmed) —
ge-orchestratorCI-bridge silently no-op for 19 days becauseoptional: trueSecret ref +ci_bridge.py:139fallback-skip log masked expired token. Same pattern class as F-2026-0421-BACKUP-BUCKET-NAME-MISMATCH (JobCompletemasks exit-12). Generalizable fix: coverage assertions + readiness-probes that validate credentials. - UpCloud Object Storage 2.0 S3 requires
AWS_REQUEST_CHECKSUM_CALCULATION=when_required— newer botocore (≥2024) default-enables CRC32 on PUT; UpCloud rejects withXAmzContentSHA256Mismatch. Set env var on the aws-cli client (env-from Secret in CronJob), or use equivalent SDK config. - Cluster-path ArgoCD-management inventory is not comprehensive —
k8s/base/gitlab/and by extension thepg-emergency-dumpbridge were NEVER ArgoCD-managed; onlyk8s/base/{agents,hosting,monitoring,core}have Applications. Assuming "everything in k8s/base is managed" is wrong. Probekubectl get applications.argoproj.io -n argocdbefore assuming a path is covered. - Stacked MRs are the GE pattern for dependent docs changes — 4-deep stack this session (
!28 → !32 → !33 → !34) works cleanly in GitLab by targeting parent branch. Merge order must follow parent-before-child.