Skip to content

DOMAIN:DEVOPS:DORA_METRICS

OWNER: marta (Team Alfa), iwona (Team Bravo)
UPDATED: 2026-03-24
SCOPE: measuring and improving software delivery performance
AGENTS: marta/iwona (tracking), faye/sytske (PM reporting), eltjo (trending)
SOURCE: DORA (DevOps Research and Assessment), Accelerate (Forsgren, Humble, Kim)


DORA:OVERVIEW

WHAT: four key metrics that predict software delivery performance and organizational outcomes
WHY: GE must prove to clients that agentic development delivers enterprise-grade quality
PRINCIPLE: metrics are for improvement, NEVER for punishment
PRINCIPLE: measure the system, not the individual agent

THE_FOUR_METRICS:
1. Deployment Frequency (DF) — how often we deploy to production
2. Lead Time for Changes (LT) — time from commit to production
3. Change Failure Rate (CFR) — percentage of deployments causing failures
4. Time to Restore Service (TTRS) — time to recover from production failure


DORA:BENCHMARKS

INDUSTRY_CLASSIFICATIONS

SOURCE: DORA State of DevOps Report (2023-2024)

Metric Elite High Medium Low
Deployment Frequency on-demand (multiple/day) weekly to monthly monthly to 6-monthly > 6 months
Lead Time for Changes < 1 hour 1 day to 1 week 1 month to 6 months > 6 months
Change Failure Rate 0-5% 5-10% 10-15% > 15%
Time to Restore Service < 1 hour < 1 day 1 day to 1 week > 1 week

GE_TARGETS

TARGET_TIER: Elite
RATIONALE: agentic development eliminates human bottlenecks (code review queues, meeting-driven decisions)
REALISTIC_INITIAL_TARGET: High (first 3 months per client), then Elite

Metric GE Target Measurement
Deployment Frequency multiple per day count of production deploys per day
Lead Time for Changes < 4 hours median time from first commit to production
Change Failure Rate < 5% deploys causing incident / total deploys
Time to Restore Service < 30 minutes median incident duration

DORA:DEPLOYMENT_FREQUENCY

DEFINITION

WHAT: how often the team deploys code to production
COUNTS: intentional releases (not config changes, not rollbacks)
EXCLUDES: staging deploys, feature flag toggles, infrastructure changes

HOW_TO_MEASURE

SOURCE: GitHub Actions workflow runs for deploy-production.yml
QUERY: count of successful production deploy workflow completions per time period

# Deployment frequency for last 30 days
gh api repos/{owner}/{repo}/actions/workflows/deploy-production.yml/runs \
  --paginate \
  -q '[.workflow_runs[] | select(.conclusion == "success" and .created_at > "2026-02-24")] | length'

ALTERNATIVE: count of commits to production branch (if using GitOps)
ALTERNATIVE: count of Kubernetes rollout events (if tracking via k8s)

TRACKING_SCHEMA

CREATE TABLE deployment_log (
  id SERIAL PRIMARY KEY,
  project_id INTEGER NOT NULL,
  deployed_at TIMESTAMPTZ NOT NULL,
  version TEXT NOT NULL,
  commit_sha TEXT NOT NULL,
  deployer TEXT NOT NULL,           -- agent name or 'manual'
  environment TEXT NOT NULL,        -- 'production' | 'staging'
  success BOOLEAN NOT NULL,
  rollback_of INTEGER REFERENCES deployment_log(id),
  notes TEXT
);

INTERPRETATION

HIGH_FREQUENCY (multiple/day): fast feedback, small batches, low risk per deploy
LOW_FREQUENCY (monthly+): large batches, high risk, integration pain

IF frequency drops THEN investigate:
- are PRs stuck in review? (merge gate too strict?)
- is CI too slow? (build time > 10 min?)
- is staging broken? (environment flakiness?)
- are agents blocked on discussions? (consensus bottleneck?)


DORA:LEAD_TIME_FOR_CHANGES

DEFINITION

WHAT: time from a developer's first commit to that code running in production
STARTS: first commit on the feature branch
ENDS: production deployment containing that commit
MEASURES: the efficiency of the entire delivery pipeline

HOW_TO_MEASURE

METHOD: for each production deploy, find the oldest undeployed commit

# Lead time for most recent production deploy
DEPLOY_SHA=$(gh api repos/{owner}/{repo}/actions/workflows/deploy-production.yml/runs \
  -q '.workflow_runs[0].head_sha')
DEPLOY_TIME=$(gh api repos/{owner}/{repo}/actions/workflows/deploy-production.yml/runs \
  -q '.workflow_runs[0].created_at')

# Find the oldest commit in this deploy that wasn't in the previous deploy
PREV_SHA=$(gh api repos/{owner}/{repo}/actions/workflows/deploy-production.yml/runs \
  -q '.workflow_runs[1].head_sha')

OLDEST_COMMIT_TIME=$(git log --format='%aI' ${PREV_SHA}..${DEPLOY_SHA} | tail -1)

echo "Lead time: from $OLDEST_COMMIT_TIME to $DEPLOY_TIME"

BREAKDOWN

Lead time is decomposed into phases for diagnosis:

PHASE: coding_time
DEFINITION: first commit to PR opened
OWNER: development agents
TARGET: < 2 hours

PHASE: review_time
DEFINITION: PR opened to PR approved
OWNER: marta/iwona (merge gate), koen (quality)
TARGET: < 1 hour

PHASE: merge_queue_time
DEFINITION: PR approved to merged
OWNER: GitHub merge queue
TARGET: < 30 minutes

PHASE: deploy_pipeline_time
DEFINITION: merged to running in production
OWNER: alex/tjitte (CI/CD)
TARGET: < 30 minutes

TOTAL_TARGET: < 4 hours end-to-end

OPTIMIZATION

IF coding_time high THEN specs unclear (escalate to Anna/Aimee)
IF review_time high THEN merge gate bottleneck (add reviewer capacity or automate more)
IF merge_queue_time high THEN too many PRs queued (increase parallelism or split teams)
IF deploy_pipeline_time high THEN CI/CD slow (optimize caching, parallelize jobs)


DORA:CHANGE_FAILURE_RATE

DEFINITION

WHAT: percentage of production deployments that cause a failure
FAILURE: requires hotfix, rollback, or incident response
FORMULA: (failed_deploys / total_deploys) * 100

COUNTS_AS_FAILURE:
- deployment causes user-facing error
- deployment requires rollback
- deployment triggers incident (any severity)
- deployment causes performance degradation > 20%

NOT_A_FAILURE:
- planned maintenance downtime
- config change that is quickly corrected
- cosmetic issue with no functional impact

HOW_TO_MEASURE

METHOD: track which deployments are followed by a revert or incident within 7 days

-- Change Failure Rate for last 30 days
SELECT
  COUNT(*) FILTER (WHERE is_failure) AS failed_deploys,
  COUNT(*) AS total_deploys,
  ROUND(
    COUNT(*) FILTER (WHERE is_failure)::numeric / COUNT(*)::numeric * 100, 1
  ) AS cfr_percent
FROM deployment_log
WHERE environment = 'production'
  AND deployed_at > NOW() - INTERVAL '30 days';

LABELING_FAILURES

METHOD: marta/iwona labels deploys as failures retroactively
TRIGGER: any of:
- rollback deploy created (deployment_log.rollback_of is not null)
- incident created referencing deploy commit
- hotfix PR created within 7 days referencing same area of code

REDUCTION_STRATEGIES

STRATEGY: increase test coverage on changed code
MEASURE: require coverage diff >= 80% on PR

STRATEGY: expand e2e test scenarios
MEASURE: Playwright tests cover all critical user paths

STRATEGY: canary deployments
MEASURE: deploy to 10% traffic first, monitor, then full rollout

STRATEGY: feature flags
MEASURE: ship code behind flag, enable incrementally

STRATEGY: pre-production smoke tests
MEASURE: automated health check after staging deploy


DORA:TIME_TO_RESTORE_SERVICE

DEFINITION

WHAT: time from production failure detection to service restoration
STARTS: first alert or user report
ENDS: service confirmed healthy
MEASURES: incident response effectiveness

HOW_TO_MEASURE

SOURCE: incident tracking (GitHub Issues with incident label)

-- TTRS for last 90 days
SELECT
  AVG(resolved_at - detected_at) AS avg_ttrs,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY resolved_at - detected_at) AS median_ttrs,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY resolved_at - detected_at) AS p95_ttrs
FROM incidents
WHERE detected_at > NOW() - INTERVAL '90 days'
  AND resolved_at IS NOT NULL;

INCIDENT_SEVERITY_AND_TTRS

Severity Definition TTRS Target
SEV1 (critical) service completely down < 15 minutes
SEV2 (major) major feature broken < 30 minutes
SEV3 (minor) degraded but functional < 4 hours
SEV4 (low) cosmetic or minor impact < 24 hours

OPTIMIZATION

STRATEGY: automated rollback on health check failure
IMPACT: reduces TTRS for deployment-caused failures to < 5 minutes

STRATEGY: runbooks for common failures
IMPACT: agents follow documented steps instead of investigating from scratch

STRATEGY: observability (structured logging, error tracking)
IMPACT: faster root cause identification

STRATEGY: incident response drills
IMPACT: agents practice recovery procedures before real incidents

SEE: domains/incident-response/ for detailed procedures


DORA:PER_PROJECT_TRACKING

DATA_MODEL

project
  ├── deployment_log[]      (all deploys)
  ├── incidents[]           (all production incidents)
  ├── pull_requests[]       (all merged PRs)
  └── dora_snapshots[]      (weekly aggregated metrics)

WEEKLY_SNAPSHOT

CREATE TABLE dora_snapshots (
  id SERIAL PRIMARY KEY,
  project_id INTEGER NOT NULL,
  week_start DATE NOT NULL,
  deployment_frequency NUMERIC,     -- deploys per day
  lead_time_hours NUMERIC,          -- median hours
  change_failure_rate NUMERIC,      -- percentage
  time_to_restore_minutes NUMERIC,  -- median minutes
  tier TEXT,                        -- 'elite' | 'high' | 'medium' | 'low'
  created_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(project_id, week_start)
);

TIER_CALCULATION

FUNCTION calculate_tier(df, lt, cfr, ttrs):
  scores = []
  scores.append('elite' if df >= 1 else 'high' if df >= 0.25 else 'medium' if df >= 0.033 else 'low')
  scores.append('elite' if lt < 1 else 'high' if lt < 168 else 'medium' if lt < 4320 else 'low')
  scores.append('elite' if cfr <= 5 else 'high' if cfr <= 10 else 'medium' if cfr <= 15 else 'low')
  scores.append('elite' if ttrs < 60 else 'high' if ttrs < 1440 else 'medium' if ttrs < 10080 else 'low')
  RETURN lowest of scores (weakest link determines tier)

RATIONALE: one weak metric pulls the whole tier down
ANALOGY: a chain is only as strong as its weakest link


DORA:TRENDING

VISUALIZATION

CHART: line chart per metric over time (weekly data points)
CHART: tier badge with color (elite=green, high=blue, medium=yellow, low=red)
CHART: metric breakdown by phase (coding, review, queue, deploy)

TREND_ALERTS

ALERT: metric degrades by one tier for 2 consecutive weeks
ACTION: eltjo flags in monitoring report, marta/iwona investigate

ALERT: change failure rate crosses 15%
ACTION: automatic freeze on production deploys until root cause found

ALERT: lead time exceeds 1 week
ACTION: faye/sytske reviews pipeline bottlenecks

REPORTING_CADENCE

DAILY: deployment count, any incidents (automated)
WEEKLY: full DORA snapshot with tier classification (marta/iwona)
MONTHLY: trend report with improvement recommendations (faye/sytske)
QUARTERLY: client-facing DORA summary (margot, part of client reporting)


DORA:IMPROVEMENT_FRAMEWORK

PRINCIPLES

PRINCIPLE: metrics drive improvement, not blame
PRINCIPLE: compare against yourself (trending), not others (ranking)
PRINCIPLE: improve the weakest metric first (bottleneck theory)
PRINCIPLE: small, continuous improvements beat big-bang changes

IMPROVEMENT_CYCLE

MEASURE → DIAGNOSE → HYPOTHESIZE → EXPERIMENT → MEASURE

STEP: measure
ACTION: collect DORA snapshot for current week

STEP: diagnose
ACTION: identify weakest metric, decompose into phases

STEP: hypothesize
ACTION: propose specific change to improve bottleneck phase

STEP: experiment
ACTION: implement change for 2 weeks

STEP: measure
ACTION: compare new DORA snapshot against baseline

IF improved THEN adopt change permanently
IF no change THEN revert experiment, try different hypothesis
IF degraded THEN revert immediately

COMMON_IMPROVEMENTS

Metric Bottleneck Improvement
DF manual deploy approval automate staging, reduce production approval to 1 click
DF long CI pipeline parallelize jobs, optimize caching
LT slow code review automate style checks, focus human review on logic
LT large PRs enforce size limits, coach agents on PR splitting
CFR insufficient testing increase e2e coverage, add contract tests
CFR no canary deployments implement progressive rollout
TTRS no runbooks document common failures and recovery steps
TTRS slow detection improve alerting, add health check endpoints

DORA:GE_SPECIFIC_CONSIDERATIONS

AGENTIC_ADVANTAGE

GE's agent-based development should naturally excel at DORA metrics:
- agents work 24/7, no handoff delays (reduces lead time)
- agents follow process consistently (reduces change failure rate)
- automated merge gate catches issues early (reduces CFR)
- agents respond to incidents immediately (reduces TTRS)

AGENTIC_RISK

RISK: agents may optimize for deployment frequency at the expense of quality
MITIGATION: CFR is a counterbalancing metric — fast but broken doesn't count

RISK: metrics gaming (splitting deploys to inflate frequency)
MITIGATION: only count deploys that include user-facing changes

RISK: low TTRS because agents auto-rollback without root-causing
MITIGATION: track both TTRS and "time to root cause" separately

CLIENT_REPORTING

WHAT: quarterly DORA summary included in client reports
WHO: margot prepares, using data from marta/iwona
FORMAT: tier badge + trend chart + key improvements made
GOAL: demonstrate to clients that GE delivers at elite/high performance

SAMPLE:

Q1 2026 — Project: ClientName
  Deployment Frequency: 2.3/day (Elite)
  Lead Time: 3.2 hours (Elite)
  Change Failure Rate: 4.1% (Elite)
  Time to Restore: 22 min (Elite)
  Overall: ELITE tier

  Improvements this quarter:
  - Reduced lead time from 8h to 3.2h by parallelizing CI jobs
  - Reduced CFR from 8% to 4.1% by adding contract tests


DORA:PITFALLS

PITFALL: measuring deployment frequency by counting ALL workflow runs
IMPACT: inflated numbers from CI reruns, staging deploys
FIX: only count successful production deploy workflows

PITFALL: measuring lead time from merge (not first commit)
IMPACT: hides coding and review bottlenecks
FIX: always measure from first commit on feature branch

PITFALL: not labeling failed deployments retroactively
IMPACT: CFR appears artificially low
FIX: marta/iwona must review all incidents and link to causing deploy

PITFALL: using DORA metrics to compare agents against each other
IMPACT: gaming, blame culture, hiding problems
RULE: DORA metrics are always per-project, never per-agent

PITFALL: measuring TTRS from when engineer starts working (not when alert fires)
IMPACT: hides detection latency
FIX: TTRS starts at first alert timestamp, not first human response

PITFALL: optimizing one metric at the expense of others
IMPACT: fast deploys with high failure rate is worse than slower, reliable deploys
RULE: always track all four together, improve weakest first