DOMAIN:INFRASTRUCTURE:BACKUP_DISASTER_RECOVERY¶
OWNER: otto UPDATED: 2026-03-24 SCOPE: all backup, restore, and disaster recovery operations across GE AGENTS: otto (primary), gerco (local storage), arjan (offsite storage), boris/yoanna (database), mira (incident commander) ISO_27001: A.8.13 (Information backup), A.5.29-30 (Business continuity)
BACKUP:OVERVIEW¶
PRINCIPLE: 3-2-1 backup rule — always, everywhere, no exceptions SSOT: PostgreSQL is the source of truth — backups protect the SSOT ENCRYPTION: AES-256 for all backups at rest, TLS for all backups in transit RETENTION: daily=30d, weekly=12w, monthly=12m, yearly=7y SCHEDULE: databases 2am CET, files 3am CET, config continuous (git) RULE: every backup MUST be restorable — untested backups are not backups
BACKUP:3_2_1_RULE¶
THREE COPIES¶
COPY_1: production (live data)
LOCATION: production database, production filesystem
OWNER: rutger (Zone 3 production ops)
COPY_2: local backup storage
LOCATION: fort-knox-dev (Minisforum 790 Pro)
OWNER: gerco (Zone 1 sysadmin)
MEDIUM: NVMe SSD
COPY_3: offsite backup storage
LOCATION: UpCloud Object Storage (different zone from production)
OWNER: arjan (provisions storage via Terraform)
MEDIUM: object storage (S3-compatible)
TWO MEDIA TYPES¶
TYPE_1: disk (SSD/NVMe on fort-knox-dev)
TYPE_2: object storage (UpCloud, different physical infrastructure)
ONE OFFSITE¶
OFFSITE: UpCloud nl-ams1 (if primary is de-fra1) or vice versa
RULE: EU only — data sovereignty compliance
RULE: offsite must be in a different physical data center than production
WEEKLY_COMPLIANCE_CHECK¶
TOOL: verification script
RUN: bash scripts/backup-3-2-1-check.sh
CHECK:
→ all 3 copies exist?
→ both media types accessible?
→ offsite location reachable?
→ IF any violation → alert human immediately
→ report to amber for ISO 27001 evidence
BACKUP:POSTGRESQL¶
LOGICAL_BACKUP (pg_dump)¶
PURPOSE: portable, human-readable, table-level restore capability SCHEDULE: daily at 2am CET RETENTION: 30 days daily, 12 months monthly
TOOL: pg_dump
RUN: pg_dump -h {host} -U {user} -d {database} \
--format=custom \
--compress=9 \
--file=/backups/{client}/{project}/pg_dump/{date}.dump
ENCRYPT:
RUN: gpg --symmetric --cipher-algo AES256 \
--passphrase-file /vault/secrets/backup-key \
/backups/{client}/{project}/pg_dump/{date}.dump
VERIFY:
RUN: pg_restore --list /backups/{client}/{project}/pg_dump/{date}.dump.gpg
EXPECT: table listing without errors
OFFSITE_COPY:
TOOL: rclone or s3cmd
RUN: rclone copy /backups/{client}/{project}/pg_dump/{date}.dump.gpg \
upcloud-s3:ge-backups/{client}/{project}/pg_dump/
PHYSICAL_BACKUP (pg_basebackup)¶
PURPOSE: full cluster backup, faster restore for large databases USE_WHEN: database > 10GB, or RTO requirement < 30 minutes
TOOL: pg_basebackup
RUN: pg_basebackup -h {host} -U replication -D /backups/{client}/base/{date} \
--checkpoint=fast \
--wal-method=stream \
--compress=gzip
WAL_ARCHIVING (continuous archival)¶
PURPOSE: point-in-time recovery (PITR) — restore to any second USE_WHEN: RPO must be < 1 hour, or data changes rapidly
POSTGRESQL_CONFIG:
archive_mode = on
archive_command = 'test ! -f /backups/wal/%f && cp %p /backups/wal/%f'
wal_level = replica
POINT_IN_TIME_RECOVERY:
TOOL: pg_restore / recovery.conf
STEPS:
1. Restore base backup:
RUN: pg_restore -D /var/lib/postgresql/data /backups/{client}/base/{date}
2. Configure recovery target:
recovery_target_time = '{YYYY-MM-DD HH:MM:SS+00}'
restore_command = 'cp /backups/wal/%f %p'
3. Start PostgreSQL — it replays WAL to target time
4. VERIFY: check data matches expected state at target time
MANAGED_DATABASE_BACKUP (UpCloud)¶
FOR_UPCLOUD_MANAGED_DB: - UpCloud takes automated daily backups (configured via Terraform) - backup_hour: 2 (2am CET) - Retention: per UpCloud plan (typically 7-30 days) - PITR available on Business plans
SUPPLEMENT: otto's pg_dump runs IN ADDITION to UpCloud automated backups REASON: UpCloud backups cannot be exported — need portable copies for 3-2-1 compliance
BACKUP:K8S_ETCD¶
ZONE_1 (k3s)¶
k3s uses SQLite by default (single-node). Backup is file copy.
TOOL: bash
RUN: cp /var/lib/rancher/k3s/server/db/state.db \
/backups/k3s/etcd/state-{date}.db
SCHEDULE: daily at 1am CET (before database backups)
ZONE_2/3 (UpCloud MKE)¶
UpCloud manages etcd for Managed Kubernetes clusters. BACKUP: managed by UpCloud (included in service) SUPPLEMENT: all k8s manifests in git (Terraform + YAML) — infrastructure is reproducible
RULE: k8s state (deployments, services, secrets) must be reproducible from git RULE: Secrets backed up separately (Vault backup, not k8s secret backup)
BACKUP:VAULT¶
PURPOSE: HashiCorp Vault contains all secrets — loss = catastrophic SCHEDULE: daily at 1:30am CET (before database backups)
TOOL: vault operator raft
RUN: vault operator raft snapshot save /backups/vault/snapshot-{date}.snap
ENCRYPT: gpg encrypt with separate key (not stored in Vault itself!)
OFFSITE: copy to UpCloud Object Storage immediately
RESTORE_PROCEDURE:
TOOL: vault operator raft
RUN: vault operator raft snapshot restore /backups/vault/snapshot-{date}.snap
VERIFY: vault status
VERIFY: vault kv list secret/
WARNING: restoring Vault may invalidate active tokens — all services must re-authenticate
CRITICAL: Vault unseal keys and root token backed up separately, stored offline CRITICAL: backup encryption key for Vault snapshots stored OUTSIDE Vault (paper + safe)
BACKUP:FILE_BACKUPS¶
METHOD: borg or restic (deduplicated, encrypted)¶
TOOL: borg
RUN: borg create \
--compression zstd,6 \
--encryption repokey \
/backups/borg/{client}::{date} \
/data/{client}/{project}/
PRUNE (retention enforcement):
RUN: borg prune \
--keep-daily=30 \
--keep-weekly=12 \
--keep-monthly=12 \
--keep-yearly=7 \
/backups/borg/{client}
VERIFY:
TOOL: borg
RUN: borg check /backups/borg/{client}
RUN: borg list /backups/borg/{client}
EXPECT: archive listing with dates and sizes, no errors
CONFIG_BACKUPS¶
METHOD: git — all Terraform, k8s manifests, application configs SCHEDULE: continuous (every commit) RETENTION: infinite (git history) STORAGE: git repository (multiple remotes for redundancy) RULE: config is code — if it's not in git, it doesn't exist
BACKUP:RETENTION_SCHEDULE¶
| Period | Keep | Promotion Rule |
|---|---|---|
| Daily | 30 days | Every daily backup |
| Weekly | 12 weeks | Sunday backup promoted |
| Monthly | 12 months | 1st of month promoted |
| Yearly | 7 years | January 1st promoted |
CLEANUP_SCHEDULE: weekly Sunday 3am RULES: - Identify expired backups - Verify newer backup exists before deleting expired - NEVER delete last remaining backup - Log all deletions for audit trail
BACKUP:RESTORE_TESTING¶
SCHEDULE¶
FREQUENCY: monthly restore test (minimum) REQUIRED_BY: ISO 27001 A.8.13 — backup testing evidence OWNER: otto (executes test), amber (audits evidence)
PROCEDURE¶
1. SELECT backup to test (random selection from last 30 days)
2. CREATE isolated test environment:
TOOL: kubectl
RUN: kubectl create namespace restore-test-{date}
3. RESTORE backup to test environment:
→ database: pg_restore to temporary instance
→ files: borg extract to temporary directory
4. VERIFY restoration:
→ schema matches production? (diff pg_dump --schema-only)
→ row counts match expected? (SELECT count(*) for key tables)
→ recent data present? (check last INSERT timestamps)
→ application connects and functions? (basic smoke test)
5. MEASURE metrics:
→ RTO_actual: time from start to verified restoration
→ RPO_actual: gap between backup time and latest data in backup
6. DOCUMENT results (restore test report template)
7. CLEANUP:
TOOL: kubectl
RUN: kubectl delete namespace restore-test-{date}
8. REPORT to amber for compliance evidence
RESTORE_TEST_REPORT¶
# Restore Test — {project}
TESTED_BY: otto
DATE: {ISO timestamp}
## TARGET
database: {name}
backup_date: {date}
backup_size: {size}
## METRICS
RTO_measured: {minutes} (target: <240min)
RPO_measured: {hours} (target: <24h)
## VERIFICATION
- schema matches: {yes/no}
- row count: {expected} / {actual}
- recent data within RPO: {yes/no}
- application connected: {yes/no}
## RESULT: PASS | FAIL
ISSUES: {none | details}
BACKUP:RTO_RPO_DEFINITIONS¶
RPO (Recovery Point Objective)¶
DEFINITION: maximum acceptable data loss measured in time MEANING: "how much data can we afford to lose?"
| Tier | RPO | Backup Method |
|---|---|---|
| Critical | < 1 hour | WAL archiving + streaming replication |
| Standard | < 24 hours | Daily pg_dump |
| Low | < 72 hours | Daily file backup with weekly retention |
RTO (Recovery Time Objective)¶
DEFINITION: maximum acceptable time to restore service MEANING: "how long can the service be down?"
| Tier | RTO | Recovery Method |
|---|---|---|
| Critical | < 1 hour | Standby promotion + DNS failover |
| Standard | < 4 hours | Restore from backup + redeploy |
| Low | < 24 hours | Full rebuild from Terraform + backup restore |
PER_SERVICE_TARGETS¶
| Service | RPO | RTO | Rationale |
|---|---|---|---|
| Client production DB | 1 hour | 1 hour | Revenue-generating, client SLA |
| GE platform (admin-ui, orchestrator) | 24 hours | 4 hours | Internal, can tolerate brief outage |
| Wiki brain | 24 hours | 4 hours | Knowledge store, git-backed |
| Vault | 24 hours | 1 hour | Secrets needed for all other recoveries |
| Redis | N/A (ephemeral) | 15 min | Streams rebuilt from source, restart is recovery |
BACKUP:BCP_TEMPLATE¶
BUSINESS_CONTINUITY_PLAN¶
# Business Continuity Plan — {project}
VERSION: {X.Y}
REVIEWED_BY: otto
DATE: {ISO timestamp}
NEXT_REVIEW: {date + 3 months}
## 1. SCOPE
project: {project}
client: {client}
systems: {list of systems covered}
## 2. RECOVERY_OBJECTIVES
RPO: {hours}
RTO: {hours}
## 3. BACKUP_INVENTORY
| data | method | frequency | retention | storage |
|---|---|---|---|---|
| PostgreSQL | pg_dump | daily 2am | 30d/12m | local + offsite |
| WAL | archiving | continuous | 7 days | local |
| files | borg | daily 3am | 30d/12m | local + offsite |
| config | git | continuous | infinite | git |
| Vault | raft snapshot | daily 1:30am | 30d | local + offsite |
## 4. RECOVERY_PROCEDURES
### 4a. Database Recovery
1. identify backup
2. decrypt + decompress
3. pg_restore to target
4. verify data integrity
5. reconnect application
### 4b. Full DR Failover
1. HALT — escalate to Dirk-Jan
2. Mira activates DR protocol
3. Otto verifies backups in DR zone (nl-ams1)
4. Arjan activates DR infrastructure
5. Boris promotes standby / restores backup
6. Rutger deploys workloads to DR cluster
7. Stef updates DNS to DR endpoints
8. Karel updates CDN origin
9. Verify + confirm recovery
## 5. CONTACTS
| role | agent | escalation |
|---|---|---|
| incident commander | mira | immediate |
| infrastructure | arjan | immediate |
| DBA | boris | immediate |
| human | dirk-jan | within 15 min |
## 6. TEST_HISTORY
| date | type | result | rto_actual | rpo_actual |
|---|---|---|---|---|
| {date} | restore test | pass | {X}h | {X}h |
BACKUP:DR_DRILL_SCHEDULE¶
DRILL_TYPES¶
| Drill Type | Frequency | Duration | Scope |
|---|---|---|---|
| Backup restore test | Monthly | 2 hours | Single database restore |
| Partial DR failover | Quarterly | 4 hours | Single client to DR zone |
| Full DR failover | Annually | 8 hours | Complete platform to DR zone |
| BCP tabletop exercise | Quarterly | 2 hours | Walk-through without execution |
DRILL_PROCEDURE¶
PRE_DRILL:
1. Otto schedules drill (minimum 1 week notice)
2. Notify affected agents (mira, arjan, boris, rutger, stef, karel)
3. Prepare test environment (isolated from production)
4. Document expected outcomes
DURING_DRILL:
1. Otto initiates drill scenario
2. Each agent executes their recovery role
3. All actions timestamped (same as real incident)
4. Measure actual RTO and RPO
POST_DRILL:
1. Compare actual vs target RTO/RPO
2. Document issues discovered
3. Update BCP based on findings
4. Report to amber for compliance evidence
5. Schedule follow-up for any failed steps
BACKUP:ISO_27001_MAPPING¶
| ISO Control | GE Implementation | Evidence |
|---|---|---|
| A.8.13 Information backup | 3-2-1 rule, encrypted, retention schedule | Backup logs, 3-2-1 check reports |
| A.5.29 ICT readiness for business continuity | BCP per client, DR zone ready | BCP documents, DR drill reports |
| A.5.30 ICT readiness for business continuity | Regular DR drills, restore testing | Drill reports, restore test reports |
| A.8.14 Redundancy of information processing | Multi-zone architecture (de-fra1 + nl-ams1) | Terraform state, zone configuration |
BACKUP:AGENT_WORKFLOW¶
FOR_OTTO¶
ON_BACKUP_TASK: 1. READ this page for backup standards 2. CHECK 3-2-1 compliance weekly 3. EXECUTE restore tests monthly 4. MAINTAIN BCP documents per client 5. COORDINATE DR drills per schedule 6. REPORT to amber for compliance evidence 7. NEVER restore directly to production — always to isolated environment first
FOR_GERCO (local storage)¶
ON_STORAGE_REQUEST: 1. PROVISION local backup storage on fort-knox-dev 2. MONITOR disk space and health 3. ALERT otto if storage approaching capacity 4. MAINTAIN backup directory structure
FOR_ARJAN (offsite storage)¶
ON_OFFSITE_REQUEST: 1. PROVISION UpCloud Object Storage via Terraform 2. CONFIGURE retention policies 3. VERIFY offsite replication working 4. EU-only zones (data sovereignty)
BACKUP:ANTI_PATTERNS¶
BEFORE_EVERY_BACKUP_ACTION: 1. Am I storing backup encryption keys in the same place as backups? (NEVER) 2. Am I skipping backup verification? (NEVER — unverified backup = no backup) 3. Am I restoring directly to production? (NEVER — isolated environment first) 4. Am I deleting the last remaining backup? (NEVER) 5. Am I backing up to the same physical location? (NEVER — offsite required) 6. Am I storing backups outside EU? (NEVER — data sovereignty) 7. Am I skipping monthly restore tests? (NEVER — ISO 27001 requires evidence) 8. Am I keeping backups longer than retention policy? (check GDPR — data minimization)
BACKUP:CROSS_REFERENCES¶
KUBERNETES_OPERATIONS: domains/infrastructure/kubernetes-operations.md — etcd backup, pod recovery TERRAFORM_PATTERNS: domains/infrastructure/terraform-patterns.md — storage provisioning DEPLOYMENT_STRATEGIES: domains/infrastructure/deployment-strategies.md — rollback as recovery INCIDENT_RESPONSE: domains/incident-response/index.md — DR activation during incidents COMPLIANCE: domains/compliance-frameworks/index.md — ISO 27001 backup controls DATABASE: domains/database/index.md — PostgreSQL-specific backup patterns