Skip to content

DOMAIN:INFRASTRUCTURE:BACKUP_DISASTER_RECOVERY

OWNER: otto UPDATED: 2026-03-24 SCOPE: all backup, restore, and disaster recovery operations across GE AGENTS: otto (primary), gerco (local storage), arjan (offsite storage), boris/yoanna (database), mira (incident commander) ISO_27001: A.8.13 (Information backup), A.5.29-30 (Business continuity)


BACKUP:OVERVIEW

PRINCIPLE: 3-2-1 backup rule — always, everywhere, no exceptions SSOT: PostgreSQL is the source of truth — backups protect the SSOT ENCRYPTION: AES-256 for all backups at rest, TLS for all backups in transit RETENTION: daily=30d, weekly=12w, monthly=12m, yearly=7y SCHEDULE: databases 2am CET, files 3am CET, config continuous (git) RULE: every backup MUST be restorable — untested backups are not backups


BACKUP:3_2_1_RULE

THREE COPIES

COPY_1: production (live data)
  LOCATION: production database, production filesystem
  OWNER: rutger (Zone 3 production ops)

COPY_2: local backup storage
  LOCATION: fort-knox-dev (Minisforum 790 Pro)
  OWNER: gerco (Zone 1 sysadmin)
  MEDIUM: NVMe SSD

COPY_3: offsite backup storage
  LOCATION: UpCloud Object Storage (different zone from production)
  OWNER: arjan (provisions storage via Terraform)
  MEDIUM: object storage (S3-compatible)

TWO MEDIA TYPES

TYPE_1: disk (SSD/NVMe on fort-knox-dev)
TYPE_2: object storage (UpCloud, different physical infrastructure)

ONE OFFSITE

OFFSITE: UpCloud nl-ams1 (if primary is de-fra1) or vice versa
RULE: EU only — data sovereignty compliance
RULE: offsite must be in a different physical data center than production

WEEKLY_COMPLIANCE_CHECK

TOOL: verification script
RUN: bash scripts/backup-3-2-1-check.sh
CHECK:
  → all 3 copies exist?
  → both media types accessible?
  → offsite location reachable?
  → IF any violation → alert human immediately
  → report to amber for ISO 27001 evidence

BACKUP:POSTGRESQL

LOGICAL_BACKUP (pg_dump)

PURPOSE: portable, human-readable, table-level restore capability SCHEDULE: daily at 2am CET RETENTION: 30 days daily, 12 months monthly

TOOL: pg_dump
RUN: pg_dump -h {host} -U {user} -d {database} \
  --format=custom \
  --compress=9 \
  --file=/backups/{client}/{project}/pg_dump/{date}.dump

ENCRYPT:
RUN: gpg --symmetric --cipher-algo AES256 \
  --passphrase-file /vault/secrets/backup-key \
  /backups/{client}/{project}/pg_dump/{date}.dump

VERIFY:
RUN: pg_restore --list /backups/{client}/{project}/pg_dump/{date}.dump.gpg
EXPECT: table listing without errors

OFFSITE_COPY:

TOOL: rclone or s3cmd
RUN: rclone copy /backups/{client}/{project}/pg_dump/{date}.dump.gpg \
  upcloud-s3:ge-backups/{client}/{project}/pg_dump/

PHYSICAL_BACKUP (pg_basebackup)

PURPOSE: full cluster backup, faster restore for large databases USE_WHEN: database > 10GB, or RTO requirement < 30 minutes

TOOL: pg_basebackup
RUN: pg_basebackup -h {host} -U replication -D /backups/{client}/base/{date} \
  --checkpoint=fast \
  --wal-method=stream \
  --compress=gzip

WAL_ARCHIVING (continuous archival)

PURPOSE: point-in-time recovery (PITR) — restore to any second USE_WHEN: RPO must be < 1 hour, or data changes rapidly

POSTGRESQL_CONFIG:

archive_mode = on
archive_command = 'test ! -f /backups/wal/%f && cp %p /backups/wal/%f'
wal_level = replica

POINT_IN_TIME_RECOVERY:

TOOL: pg_restore / recovery.conf
STEPS:
1. Restore base backup:
   RUN: pg_restore -D /var/lib/postgresql/data /backups/{client}/base/{date}
2. Configure recovery target:
   recovery_target_time = '{YYYY-MM-DD HH:MM:SS+00}'
   restore_command = 'cp /backups/wal/%f %p'
3. Start PostgreSQL — it replays WAL to target time
4. VERIFY: check data matches expected state at target time

MANAGED_DATABASE_BACKUP (UpCloud)

FOR_UPCLOUD_MANAGED_DB: - UpCloud takes automated daily backups (configured via Terraform) - backup_hour: 2 (2am CET) - Retention: per UpCloud plan (typically 7-30 days) - PITR available on Business plans

SUPPLEMENT: otto's pg_dump runs IN ADDITION to UpCloud automated backups REASON: UpCloud backups cannot be exported — need portable copies for 3-2-1 compliance


BACKUP:K8S_ETCD

ZONE_1 (k3s)

k3s uses SQLite by default (single-node). Backup is file copy.

TOOL: bash
RUN: cp /var/lib/rancher/k3s/server/db/state.db \
  /backups/k3s/etcd/state-{date}.db
SCHEDULE: daily at 1am CET (before database backups)

ZONE_2/3 (UpCloud MKE)

UpCloud manages etcd for Managed Kubernetes clusters. BACKUP: managed by UpCloud (included in service) SUPPLEMENT: all k8s manifests in git (Terraform + YAML) — infrastructure is reproducible

RULE: k8s state (deployments, services, secrets) must be reproducible from git RULE: Secrets backed up separately (Vault backup, not k8s secret backup)


BACKUP:VAULT

PURPOSE: HashiCorp Vault contains all secrets — loss = catastrophic SCHEDULE: daily at 1:30am CET (before database backups)

TOOL: vault operator raft
RUN: vault operator raft snapshot save /backups/vault/snapshot-{date}.snap
ENCRYPT: gpg encrypt with separate key (not stored in Vault itself!)
OFFSITE: copy to UpCloud Object Storage immediately

RESTORE_PROCEDURE:

TOOL: vault operator raft
RUN: vault operator raft snapshot restore /backups/vault/snapshot-{date}.snap
VERIFY: vault status
VERIFY: vault kv list secret/
WARNING: restoring Vault may invalidate active tokens — all services must re-authenticate

CRITICAL: Vault unseal keys and root token backed up separately, stored offline CRITICAL: backup encryption key for Vault snapshots stored OUTSIDE Vault (paper + safe)


BACKUP:FILE_BACKUPS

METHOD: borg or restic (deduplicated, encrypted)

TOOL: borg
RUN: borg create \
  --compression zstd,6 \
  --encryption repokey \
  /backups/borg/{client}::{date} \
  /data/{client}/{project}/

PRUNE (retention enforcement):
RUN: borg prune \
  --keep-daily=30 \
  --keep-weekly=12 \
  --keep-monthly=12 \
  --keep-yearly=7 \
  /backups/borg/{client}

VERIFY:

TOOL: borg
RUN: borg check /backups/borg/{client}
RUN: borg list /backups/borg/{client}
EXPECT: archive listing with dates and sizes, no errors

CONFIG_BACKUPS

METHOD: git — all Terraform, k8s manifests, application configs SCHEDULE: continuous (every commit) RETENTION: infinite (git history) STORAGE: git repository (multiple remotes for redundancy) RULE: config is code — if it's not in git, it doesn't exist


BACKUP:RETENTION_SCHEDULE

Period Keep Promotion Rule
Daily 30 days Every daily backup
Weekly 12 weeks Sunday backup promoted
Monthly 12 months 1st of month promoted
Yearly 7 years January 1st promoted

CLEANUP_SCHEDULE: weekly Sunday 3am RULES: - Identify expired backups - Verify newer backup exists before deleting expired - NEVER delete last remaining backup - Log all deletions for audit trail


BACKUP:RESTORE_TESTING

SCHEDULE

FREQUENCY: monthly restore test (minimum) REQUIRED_BY: ISO 27001 A.8.13 — backup testing evidence OWNER: otto (executes test), amber (audits evidence)

PROCEDURE

1. SELECT backup to test (random selection from last 30 days)
2. CREATE isolated test environment:
   TOOL: kubectl
   RUN: kubectl create namespace restore-test-{date}
3. RESTORE backup to test environment:
   → database: pg_restore to temporary instance
   → files: borg extract to temporary directory
4. VERIFY restoration:
   → schema matches production? (diff pg_dump --schema-only)
   → row counts match expected? (SELECT count(*) for key tables)
   → recent data present? (check last INSERT timestamps)
   → application connects and functions? (basic smoke test)
5. MEASURE metrics:
   → RTO_actual: time from start to verified restoration
   → RPO_actual: gap between backup time and latest data in backup
6. DOCUMENT results (restore test report template)
7. CLEANUP:
   TOOL: kubectl
   RUN: kubectl delete namespace restore-test-{date}
8. REPORT to amber for compliance evidence

RESTORE_TEST_REPORT

# Restore Test — {project}

TESTED_BY: otto
DATE: {ISO timestamp}

## TARGET
database: {name}
backup_date: {date}
backup_size: {size}

## METRICS
RTO_measured: {minutes} (target: <240min)
RPO_measured: {hours} (target: <24h)

## VERIFICATION
- schema matches: {yes/no}
- row count: {expected} / {actual}
- recent data within RPO: {yes/no}
- application connected: {yes/no}

## RESULT: PASS | FAIL
ISSUES: {none | details}

BACKUP:RTO_RPO_DEFINITIONS

RPO (Recovery Point Objective)

DEFINITION: maximum acceptable data loss measured in time MEANING: "how much data can we afford to lose?"

Tier RPO Backup Method
Critical < 1 hour WAL archiving + streaming replication
Standard < 24 hours Daily pg_dump
Low < 72 hours Daily file backup with weekly retention

RTO (Recovery Time Objective)

DEFINITION: maximum acceptable time to restore service MEANING: "how long can the service be down?"

Tier RTO Recovery Method
Critical < 1 hour Standby promotion + DNS failover
Standard < 4 hours Restore from backup + redeploy
Low < 24 hours Full rebuild from Terraform + backup restore

PER_SERVICE_TARGETS

Service RPO RTO Rationale
Client production DB 1 hour 1 hour Revenue-generating, client SLA
GE platform (admin-ui, orchestrator) 24 hours 4 hours Internal, can tolerate brief outage
Wiki brain 24 hours 4 hours Knowledge store, git-backed
Vault 24 hours 1 hour Secrets needed for all other recoveries
Redis N/A (ephemeral) 15 min Streams rebuilt from source, restart is recovery

BACKUP:BCP_TEMPLATE

BUSINESS_CONTINUITY_PLAN

# Business Continuity Plan — {project}

VERSION: {X.Y}
REVIEWED_BY: otto
DATE: {ISO timestamp}
NEXT_REVIEW: {date + 3 months}

## 1. SCOPE
project: {project}
client: {client}
systems: {list of systems covered}

## 2. RECOVERY_OBJECTIVES
RPO: {hours}
RTO: {hours}

## 3. BACKUP_INVENTORY
| data | method | frequency | retention | storage |
|---|---|---|---|---|
| PostgreSQL | pg_dump | daily 2am | 30d/12m | local + offsite |
| WAL | archiving | continuous | 7 days | local |
| files | borg | daily 3am | 30d/12m | local + offsite |
| config | git | continuous | infinite | git |
| Vault | raft snapshot | daily 1:30am | 30d | local + offsite |

## 4. RECOVERY_PROCEDURES

### 4a. Database Recovery
1. identify backup
2. decrypt + decompress
3. pg_restore to target
4. verify data integrity
5. reconnect application

### 4b. Full DR Failover
1. HALT — escalate to Dirk-Jan
2. Mira activates DR protocol
3. Otto verifies backups in DR zone (nl-ams1)
4. Arjan activates DR infrastructure
5. Boris promotes standby / restores backup
6. Rutger deploys workloads to DR cluster
7. Stef updates DNS to DR endpoints
8. Karel updates CDN origin
9. Verify + confirm recovery

## 5. CONTACTS
| role | agent | escalation |
|---|---|---|
| incident commander | mira | immediate |
| infrastructure | arjan | immediate |
| DBA | boris | immediate |
| human | dirk-jan | within 15 min |

## 6. TEST_HISTORY
| date | type | result | rto_actual | rpo_actual |
|---|---|---|---|---|
| {date} | restore test | pass | {X}h | {X}h |

BACKUP:DR_DRILL_SCHEDULE

DRILL_TYPES

Drill Type Frequency Duration Scope
Backup restore test Monthly 2 hours Single database restore
Partial DR failover Quarterly 4 hours Single client to DR zone
Full DR failover Annually 8 hours Complete platform to DR zone
BCP tabletop exercise Quarterly 2 hours Walk-through without execution

DRILL_PROCEDURE

PRE_DRILL:
1. Otto schedules drill (minimum 1 week notice)
2. Notify affected agents (mira, arjan, boris, rutger, stef, karel)
3. Prepare test environment (isolated from production)
4. Document expected outcomes

DURING_DRILL:
1. Otto initiates drill scenario
2. Each agent executes their recovery role
3. All actions timestamped (same as real incident)
4. Measure actual RTO and RPO

POST_DRILL:
1. Compare actual vs target RTO/RPO
2. Document issues discovered
3. Update BCP based on findings
4. Report to amber for compliance evidence
5. Schedule follow-up for any failed steps

BACKUP:ISO_27001_MAPPING

ISO Control GE Implementation Evidence
A.8.13 Information backup 3-2-1 rule, encrypted, retention schedule Backup logs, 3-2-1 check reports
A.5.29 ICT readiness for business continuity BCP per client, DR zone ready BCP documents, DR drill reports
A.5.30 ICT readiness for business continuity Regular DR drills, restore testing Drill reports, restore test reports
A.8.14 Redundancy of information processing Multi-zone architecture (de-fra1 + nl-ams1) Terraform state, zone configuration

BACKUP:AGENT_WORKFLOW

FOR_OTTO

ON_BACKUP_TASK: 1. READ this page for backup standards 2. CHECK 3-2-1 compliance weekly 3. EXECUTE restore tests monthly 4. MAINTAIN BCP documents per client 5. COORDINATE DR drills per schedule 6. REPORT to amber for compliance evidence 7. NEVER restore directly to production — always to isolated environment first

FOR_GERCO (local storage)

ON_STORAGE_REQUEST: 1. PROVISION local backup storage on fort-knox-dev 2. MONITOR disk space and health 3. ALERT otto if storage approaching capacity 4. MAINTAIN backup directory structure

FOR_ARJAN (offsite storage)

ON_OFFSITE_REQUEST: 1. PROVISION UpCloud Object Storage via Terraform 2. CONFIGURE retention policies 3. VERIFY offsite replication working 4. EU-only zones (data sovereignty)


BACKUP:ANTI_PATTERNS

BEFORE_EVERY_BACKUP_ACTION: 1. Am I storing backup encryption keys in the same place as backups? (NEVER) 2. Am I skipping backup verification? (NEVER — unverified backup = no backup) 3. Am I restoring directly to production? (NEVER — isolated environment first) 4. Am I deleting the last remaining backup? (NEVER) 5. Am I backing up to the same physical location? (NEVER — offsite required) 6. Am I storing backups outside EU? (NEVER — data sovereignty) 7. Am I skipping monthly restore tests? (NEVER — ISO 27001 requires evidence) 8. Am I keeping backups longer than retention policy? (check GDPR — data minimization)


BACKUP:CROSS_REFERENCES

KUBERNETES_OPERATIONS: domains/infrastructure/kubernetes-operations.md — etcd backup, pod recovery TERRAFORM_PATTERNS: domains/infrastructure/terraform-patterns.md — storage provisioning DEPLOYMENT_STRATEGIES: domains/infrastructure/deployment-strategies.md — rollback as recovery INCIDENT_RESPONSE: domains/incident-response/index.md — DR activation during incidents COMPLIANCE: domains/compliance-frameworks/index.md — ISO 27001 backup controls DATABASE: domains/database/index.md — PostgreSQL-specific backup patterns