DOMAIN:INFRASTRUCTURE:BACKUP_DISASTER_RECOVERY¶

OWNER: otto UPDATED: 2026-03-24 SCOPE: all backup, restore, and disaster recovery operations across GE AGENTS: otto (primary), gerco (local storage), arjan (offsite storage), boris/yoanna (database), mira (incident commander) ISO_27001: A.8.13 (Information backup), A.5.29-30 (Business continuity)

BACKUP:OVERVIEW¶

PRINCIPLE: 3-2-1 backup rule — always, everywhere, no exceptions SSOT: PostgreSQL is the source of truth — backups protect the SSOT ENCRYPTION: AES-256 for all backups at rest, TLS for all backups in transit RETENTION: daily=30d, weekly=12w, monthly=12m, yearly=7y SCHEDULE: databases 2am CET, files 3am CET, config continuous (git) RULE: every backup MUST be restorable — untested backups are not backups

BACKUP:3_2_1_RULE¶

THREE COPIES¶

COPY_1: production (live data)
  LOCATION: production database, production filesystem
  OWNER: rutger (Zone 3 production ops)

COPY_2: local backup storage
  LOCATION: fort-knox-dev (Minisforum 790 Pro)
  OWNER: gerco (Zone 1 sysadmin)
  MEDIUM: NVMe SSD

COPY_3: offsite backup storage
  LOCATION: UpCloud Object Storage (different zone from production)
  OWNER: arjan (provisions storage via Terraform)
  MEDIUM: object storage (S3-compatible)

TWO MEDIA TYPES¶

TYPE_1: disk (SSD/NVMe on fort-knox-dev)
TYPE_2: object storage (UpCloud, different physical infrastructure)

ONE OFFSITE¶

OFFSITE: UpCloud nl-ams1 (if primary is de-fra1) or vice versa
RULE: EU only — data sovereignty compliance
RULE: offsite must be in a different physical data center than production

WEEKLY_COMPLIANCE_CHECK¶

TOOL: verification script
RUN: bash scripts/backup-3-2-1-check.sh
CHECK:
  → all 3 copies exist?
  → both media types accessible?
  → offsite location reachable?
  → IF any violation → alert human immediately
  → report to amber for ISO 27001 evidence

BACKUP:POSTGRESQL¶

LOGICAL_BACKUP (pg_dump)¶

PURPOSE: portable, human-readable, table-level restore capability SCHEDULE: daily at 2am CET RETENTION: 30 days daily, 12 months monthly

TOOL: pg_dump
RUN: pg_dump -h {host} -U {user} -d {database} \
  --format=custom \
  --compress=9 \
  --file=/backups/{client}/{project}/pg_dump/{date}.dump

ENCRYPT:
RUN: gpg --symmetric --cipher-algo AES256 \
  --passphrase-file /vault/secrets/backup-key \
  /backups/{client}/{project}/pg_dump/{date}.dump

VERIFY:
RUN: pg_restore --list /backups/{client}/{project}/pg_dump/{date}.dump.gpg
EXPECT: table listing without errors

OFFSITE_COPY:

TOOL: rclone or s3cmd
RUN: rclone copy /backups/{client}/{project}/pg_dump/{date}.dump.gpg \
  upcloud-s3:ge-backups/{client}/{project}/pg_dump/

PHYSICAL_BACKUP (pg_basebackup)¶

PURPOSE: full cluster backup, faster restore for large databases USE_WHEN: database > 10GB, or RTO requirement < 30 minutes

TOOL: pg_basebackup
RUN: pg_basebackup -h {host} -U replication -D /backups/{client}/base/{date} \
  --checkpoint=fast \
  --wal-method=stream \
  --compress=gzip

WAL_ARCHIVING (continuous archival)¶

PURPOSE: point-in-time recovery (PITR) — restore to any second USE_WHEN: RPO must be < 1 hour, or data changes rapidly

POSTGRESQL_CONFIG:

archive_mode = on
archive_command = 'test ! -f /backups/wal/%f && cp %p /backups/wal/%f'
wal_level = replica

POINT_IN_TIME_RECOVERY:

TOOL: pg_restore / recovery.conf
STEPS:
1. Restore base backup:
   RUN: pg_restore -D /var/lib/postgresql/data /backups/{client}/base/{date}
2. Configure recovery target:
   recovery_target_time = '{YYYY-MM-DD HH:MM:SS+00}'
   restore_command = 'cp /backups/wal/%f %p'
3. Start PostgreSQL — it replays WAL to target time
4. VERIFY: check data matches expected state at target time

MANAGED_DATABASE_BACKUP (UpCloud)¶

FOR_UPCLOUD_MANAGED_DB: - UpCloud takes automated daily backups (configured via Terraform) - backup_hour: 2 (2am CET) - Retention: per UpCloud plan (typically 7-30 days) - PITR available on Business plans

SUPPLEMENT: otto's pg_dump runs IN ADDITION to UpCloud automated backups REASON: UpCloud backups cannot be exported — need portable copies for 3-2-1 compliance

BACKUP:K8S_ETCD¶

ZONE_1 (k3s)¶

k3s uses SQLite by default (single-node). Backup is file copy.

TOOL: bash
RUN: cp /var/lib/rancher/k3s/server/db/state.db \
  /backups/k3s/etcd/state-{date}.db
SCHEDULE: daily at 1am CET (before database backups)

ZONE_2/3 (UpCloud MKE)¶

UpCloud manages etcd for Managed Kubernetes clusters. BACKUP: managed by UpCloud (included in service) SUPPLEMENT: all k8s manifests in git (Terraform + YAML) — infrastructure is reproducible

RULE: k8s state (deployments, services, secrets) must be reproducible from git RULE: Secrets backed up separately (Vault backup, not k8s secret backup)

BACKUP:VAULT¶

PURPOSE: HashiCorp Vault contains all secrets — loss = catastrophic SCHEDULE: daily at 1:30am CET (before database backups)

TOOL: vault operator raft
RUN: vault operator raft snapshot save /backups/vault/snapshot-{date}.snap
ENCRYPT: gpg encrypt with separate key (not stored in Vault itself!)
OFFSITE: copy to UpCloud Object Storage immediately

RESTORE_PROCEDURE:

TOOL: vault operator raft
RUN: vault operator raft snapshot restore /backups/vault/snapshot-{date}.snap
VERIFY: vault status
VERIFY: vault kv list secret/
WARNING: restoring Vault may invalidate active tokens — all services must re-authenticate

CRITICAL: Vault unseal keys and root token backed up separately, stored offline CRITICAL: backup encryption key for Vault snapshots stored OUTSIDE Vault (paper + safe)

BACKUP:FILE_BACKUPS¶

METHOD: borg or restic (deduplicated, encrypted)¶

TOOL: borg
RUN: borg create \
  --compression zstd,6 \
  --encryption repokey \
  /backups/borg/{client}::{date} \
  /data/{client}/{project}/

PRUNE (retention enforcement):
RUN: borg prune \
  --keep-daily=30 \
  --keep-weekly=12 \
  --keep-monthly=12 \
  --keep-yearly=7 \
  /backups/borg/{client}

VERIFY:

TOOL: borg
RUN: borg check /backups/borg/{client}
RUN: borg list /backups/borg/{client}
EXPECT: archive listing with dates and sizes, no errors

CONFIG_BACKUPS¶

METHOD: git — all Terraform, k8s manifests, application configs SCHEDULE: continuous (every commit) RETENTION: infinite (git history) STORAGE: git repository (multiple remotes for redundancy) RULE: config is code — if it's not in git, it doesn't exist

BACKUP:RETENTION_SCHEDULE¶

Period	Keep	Promotion Rule
Daily	30 days	Every daily backup
Weekly	12 weeks	Sunday backup promoted
Monthly	12 months	1st of month promoted
Yearly	7 years	January 1st promoted

CLEANUP_SCHEDULE: weekly Sunday 3am RULES: - Identify expired backups - Verify newer backup exists before deleting expired - NEVER delete last remaining backup - Log all deletions for audit trail

BACKUP:RESTORE_TESTING¶

SCHEDULE¶

FREQUENCY: monthly restore test (minimum) REQUIRED_BY: ISO 27001 A.8.13 — backup testing evidence OWNER: otto (executes test), amber (audits evidence)

PROCEDURE¶

1. SELECT backup to test (random selection from last 30 days)
2. CREATE isolated test environment:
   TOOL: kubectl
   RUN: kubectl create namespace restore-test-{date}
3. RESTORE backup to test environment:
   → database: pg_restore to temporary instance
   → files: borg extract to temporary directory
4. VERIFY restoration:
   → schema matches production? (diff pg_dump --schema-only)
   → row counts match expected? (SELECT count(*) for key tables)
   → recent data present? (check last INSERT timestamps)
   → application connects and functions? (basic smoke test)
5. MEASURE metrics:
   → RTO_actual: time from start to verified restoration
   → RPO_actual: gap between backup time and latest data in backup
6. DOCUMENT results (restore test report template)
7. CLEANUP:
   TOOL: kubectl
   RUN: kubectl delete namespace restore-test-{date}
8. REPORT to amber for compliance evidence

RESTORE_TEST_REPORT¶

# Restore Test — {project}

TESTED_BY: otto
DATE: {ISO timestamp}

## TARGET
database: {name}
backup_date: {date}
backup_size: {size}

## METRICS
RTO_measured: {minutes} (target: <240min)
RPO_measured: {hours} (target: <24h)

## VERIFICATION
- schema matches: {yes/no}
- row count: {expected} / {actual}
- recent data within RPO: {yes/no}
- application connected: {yes/no}

## RESULT: PASS | FAIL
ISSUES: {none | details}

BACKUP:RTO_RPO_DEFINITIONS¶

RPO (Recovery Point Objective)¶

DEFINITION: maximum acceptable data loss measured in time MEANING: "how much data can we afford to lose?"

Tier	RPO	Backup Method
Critical	< 1 hour	WAL archiving + streaming replication
Standard	< 24 hours	Daily pg_dump
Low	< 72 hours	Daily file backup with weekly retention

RTO (Recovery Time Objective)¶

DEFINITION: maximum acceptable time to restore service MEANING: "how long can the service be down?"

Tier	RTO	Recovery Method
Critical	< 1 hour	Standby promotion + DNS failover
Standard	< 4 hours	Restore from backup + redeploy
Low	< 24 hours	Full rebuild from Terraform + backup restore

PER_SERVICE_TARGETS¶

Service	RPO	RTO	Rationale
Client production DB	1 hour	1 hour	Revenue-generating, client SLA
GE platform (admin-ui, orchestrator)	24 hours	4 hours	Internal, can tolerate brief outage
Wiki brain	24 hours	4 hours	Knowledge store, git-backed
Vault	24 hours	1 hour	Secrets needed for all other recoveries
Redis	N/A (ephemeral)	15 min	Streams rebuilt from source, restart is recovery

BACKUP:BCP_TEMPLATE¶

BUSINESS_CONTINUITY_PLAN¶

# Business Continuity Plan — {project}

VERSION: {X.Y}
REVIEWED_BY: otto
DATE: {ISO timestamp}
NEXT_REVIEW: {date + 3 months}

## 1. SCOPE
project: {project}
client: {client}
systems: {list of systems covered}

## 2. RECOVERY_OBJECTIVES
RPO: {hours}
RTO: {hours}

## 3. BACKUP_INVENTORY
| data | method | frequency | retention | storage |
|---|---|---|---|---|
| PostgreSQL | pg_dump | daily 2am | 30d/12m | local + offsite |
| WAL | archiving | continuous | 7 days | local |
| files | borg | daily 3am | 30d/12m | local + offsite |
| config | git | continuous | infinite | git |
| Vault | raft snapshot | daily 1:30am | 30d | local + offsite |

## 4. RECOVERY_PROCEDURES

### 4a. Database Recovery
1. identify backup
2. decrypt + decompress
3. pg_restore to target
4. verify data integrity
5. reconnect application

### 4b. Full DR Failover
1. HALT — escalate to Dirk-Jan
2. Mira activates DR protocol
3. Otto verifies backups in DR zone (nl-ams1)
4. Arjan activates DR infrastructure
5. Boris promotes standby / restores backup
6. Rutger deploys workloads to DR cluster
7. Stef updates DNS to DR endpoints
8. Karel updates CDN origin
9. Verify + confirm recovery

## 5. CONTACTS
| role | agent | escalation |
|---|---|---|
| incident commander | mira | immediate |
| infrastructure | arjan | immediate |
| DBA | boris | immediate |
| human | dirk-jan | within 15 min |

## 6. TEST_HISTORY
| date | type | result | rto_actual | rpo_actual |
|---|---|---|---|---|
| {date} | restore test | pass | {X}h | {X}h |

BACKUP:DR_DRILL_SCHEDULE¶

DRILL_TYPES¶

Drill Type	Frequency	Duration	Scope
Backup restore test	Monthly	2 hours	Single database restore
Partial DR failover	Quarterly	4 hours	Single client to DR zone
Full DR failover	Annually	8 hours	Complete platform to DR zone
BCP tabletop exercise	Quarterly	2 hours	Walk-through without execution

DRILL_PROCEDURE¶

PRE_DRILL:
1. Otto schedules drill (minimum 1 week notice)
2. Notify affected agents (mira, arjan, boris, rutger, stef, karel)
3. Prepare test environment (isolated from production)
4. Document expected outcomes

DURING_DRILL:
1. Otto initiates drill scenario
2. Each agent executes their recovery role
3. All actions timestamped (same as real incident)
4. Measure actual RTO and RPO

POST_DRILL:
1. Compare actual vs target RTO/RPO
2. Document issues discovered
3. Update BCP based on findings
4. Report to amber for compliance evidence
5. Schedule follow-up for any failed steps

BACKUP:ISO_27001_MAPPING¶

ISO Control	GE Implementation	Evidence
A.8.13 Information backup	3-2-1 rule, encrypted, retention schedule	Backup logs, 3-2-1 check reports
A.5.29 ICT readiness for business continuity	BCP per client, DR zone ready	BCP documents, DR drill reports
A.5.30 ICT readiness for business continuity	Regular DR drills, restore testing	Drill reports, restore test reports
A.8.14 Redundancy of information processing	Multi-zone architecture (de-fra1 + nl-ams1)	Terraform state, zone configuration

BACKUP:AGENT_WORKFLOW¶

FOR_OTTO¶

ON_BACKUP_TASK: 1. READ this page for backup standards 2. CHECK 3-2-1 compliance weekly 3. EXECUTE restore tests monthly 4. MAINTAIN BCP documents per client 5. COORDINATE DR drills per schedule 6. REPORT to amber for compliance evidence 7. NEVER restore directly to production — always to isolated environment first

FOR_GERCO (local storage)¶

ON_STORAGE_REQUEST: 1. PROVISION local backup storage on fort-knox-dev 2. MONITOR disk space and health 3. ALERT otto if storage approaching capacity 4. MAINTAIN backup directory structure

FOR_ARJAN (offsite storage)¶

ON_OFFSITE_REQUEST: 1. PROVISION UpCloud Object Storage via Terraform 2. CONFIGURE retention policies 3. VERIFY offsite replication working 4. EU-only zones (data sovereignty)

BACKUP:ANTI_PATTERNS¶

BEFORE_EVERY_BACKUP_ACTION: 1. Am I storing backup encryption keys in the same place as backups? (NEVER) 2. Am I skipping backup verification? (NEVER — unverified backup = no backup) 3. Am I restoring directly to production? (NEVER — isolated environment first) 4. Am I deleting the last remaining backup? (NEVER) 5. Am I backing up to the same physical location? (NEVER — offsite required) 6. Am I storing backups outside EU? (NEVER — data sovereignty) 7. Am I skipping monthly restore tests? (NEVER — ISO 27001 requires evidence) 8. Am I keeping backups longer than retention policy? (check GDPR — data minimization)

BACKUP:CROSS_REFERENCES¶

KUBERNETES_OPERATIONS: domains/infrastructure/kubernetes-operations.md — etcd backup, pod recovery TERRAFORM_PATTERNS: domains/infrastructure/terraform-patterns.md — storage provisioning DEPLOYMENT_STRATEGIES: domains/infrastructure/deployment-strategies.md — rollback as recovery INCIDENT_RESPONSE: domains/incident-response/index.md — DR activation during incidents COMPLIANCE: domains/compliance-frameworks/index.md — ISO 27001 backup controls DATABASE: domains/database/index.md — PostgreSQL-specific backup patterns