DOMAIN:DEVOPS:DORA_METRICS¶
OWNER: marta (Team Alfa), iwona (Team Bravo)
UPDATED: 2026-03-24
SCOPE: measuring and improving software delivery performance
AGENTS: marta/iwona (tracking), faye/sytske (PM reporting), eltjo (trending)
SOURCE: DORA (DevOps Research and Assessment), Accelerate (Forsgren, Humble, Kim)
DORA:OVERVIEW¶
WHAT: four key metrics that predict software delivery performance and organizational outcomes
WHY: GE must prove to clients that agentic development delivers enterprise-grade quality
PRINCIPLE: metrics are for improvement, NEVER for punishment
PRINCIPLE: measure the system, not the individual agent
THE_FOUR_METRICS:
1. Deployment Frequency (DF) — how often we deploy to production
2. Lead Time for Changes (LT) — time from commit to production
3. Change Failure Rate (CFR) — percentage of deployments causing failures
4. Time to Restore Service (TTRS) — time to recover from production failure
DORA:BENCHMARKS¶
INDUSTRY_CLASSIFICATIONS¶
SOURCE: DORA State of DevOps Report (2023-2024)
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment Frequency | on-demand (multiple/day) | weekly to monthly | monthly to 6-monthly | > 6 months |
| Lead Time for Changes | < 1 hour | 1 day to 1 week | 1 month to 6 months | > 6 months |
| Change Failure Rate | 0-5% | 5-10% | 10-15% | > 15% |
| Time to Restore Service | < 1 hour | < 1 day | 1 day to 1 week | > 1 week |
GE_TARGETS¶
TARGET_TIER: Elite
RATIONALE: agentic development eliminates human bottlenecks (code review queues, meeting-driven decisions)
REALISTIC_INITIAL_TARGET: High (first 3 months per client), then Elite
| Metric | GE Target | Measurement |
|---|---|---|
| Deployment Frequency | multiple per day | count of production deploys per day |
| Lead Time for Changes | < 4 hours | median time from first commit to production |
| Change Failure Rate | < 5% | deploys causing incident / total deploys |
| Time to Restore Service | < 30 minutes | median incident duration |
DORA:DEPLOYMENT_FREQUENCY¶
DEFINITION¶
WHAT: how often the team deploys code to production
COUNTS: intentional releases (not config changes, not rollbacks)
EXCLUDES: staging deploys, feature flag toggles, infrastructure changes
HOW_TO_MEASURE¶
SOURCE: GitHub Actions workflow runs for deploy-production.yml
QUERY: count of successful production deploy workflow completions per time period
# Deployment frequency for last 30 days
gh api repos/{owner}/{repo}/actions/workflows/deploy-production.yml/runs \
--paginate \
-q '[.workflow_runs[] | select(.conclusion == "success" and .created_at > "2026-02-24")] | length'
ALTERNATIVE: count of commits to production branch (if using GitOps)
ALTERNATIVE: count of Kubernetes rollout events (if tracking via k8s)
TRACKING_SCHEMA¶
CREATE TABLE deployment_log (
id SERIAL PRIMARY KEY,
project_id INTEGER NOT NULL,
deployed_at TIMESTAMPTZ NOT NULL,
version TEXT NOT NULL,
commit_sha TEXT NOT NULL,
deployer TEXT NOT NULL, -- agent name or 'manual'
environment TEXT NOT NULL, -- 'production' | 'staging'
success BOOLEAN NOT NULL,
rollback_of INTEGER REFERENCES deployment_log(id),
notes TEXT
);
INTERPRETATION¶
HIGH_FREQUENCY (multiple/day): fast feedback, small batches, low risk per deploy
LOW_FREQUENCY (monthly+): large batches, high risk, integration pain
IF frequency drops THEN investigate:
- are PRs stuck in review? (merge gate too strict?)
- is CI too slow? (build time > 10 min?)
- is staging broken? (environment flakiness?)
- are agents blocked on discussions? (consensus bottleneck?)
DORA:LEAD_TIME_FOR_CHANGES¶
DEFINITION¶
WHAT: time from a developer's first commit to that code running in production
STARTS: first commit on the feature branch
ENDS: production deployment containing that commit
MEASURES: the efficiency of the entire delivery pipeline
HOW_TO_MEASURE¶
METHOD: for each production deploy, find the oldest undeployed commit
# Lead time for most recent production deploy
DEPLOY_SHA=$(gh api repos/{owner}/{repo}/actions/workflows/deploy-production.yml/runs \
-q '.workflow_runs[0].head_sha')
DEPLOY_TIME=$(gh api repos/{owner}/{repo}/actions/workflows/deploy-production.yml/runs \
-q '.workflow_runs[0].created_at')
# Find the oldest commit in this deploy that wasn't in the previous deploy
PREV_SHA=$(gh api repos/{owner}/{repo}/actions/workflows/deploy-production.yml/runs \
-q '.workflow_runs[1].head_sha')
OLDEST_COMMIT_TIME=$(git log --format='%aI' ${PREV_SHA}..${DEPLOY_SHA} | tail -1)
echo "Lead time: from $OLDEST_COMMIT_TIME to $DEPLOY_TIME"
BREAKDOWN¶
Lead time is decomposed into phases for diagnosis:
PHASE: coding_time
DEFINITION: first commit to PR opened
OWNER: development agents
TARGET: < 2 hours
PHASE: review_time
DEFINITION: PR opened to PR approved
OWNER: marta/iwona (merge gate), koen (quality)
TARGET: < 1 hour
PHASE: merge_queue_time
DEFINITION: PR approved to merged
OWNER: GitHub merge queue
TARGET: < 30 minutes
PHASE: deploy_pipeline_time
DEFINITION: merged to running in production
OWNER: alex/tjitte (CI/CD)
TARGET: < 30 minutes
TOTAL_TARGET: < 4 hours end-to-end
OPTIMIZATION¶
IF coding_time high THEN specs unclear (escalate to Anna/Aimee)
IF review_time high THEN merge gate bottleneck (add reviewer capacity or automate more)
IF merge_queue_time high THEN too many PRs queued (increase parallelism or split teams)
IF deploy_pipeline_time high THEN CI/CD slow (optimize caching, parallelize jobs)
DORA:CHANGE_FAILURE_RATE¶
DEFINITION¶
WHAT: percentage of production deployments that cause a failure
FAILURE: requires hotfix, rollback, or incident response
FORMULA: (failed_deploys / total_deploys) * 100
COUNTS_AS_FAILURE:
- deployment causes user-facing error
- deployment requires rollback
- deployment triggers incident (any severity)
- deployment causes performance degradation > 20%
NOT_A_FAILURE:
- planned maintenance downtime
- config change that is quickly corrected
- cosmetic issue with no functional impact
HOW_TO_MEASURE¶
METHOD: track which deployments are followed by a revert or incident within 7 days
-- Change Failure Rate for last 30 days
SELECT
COUNT(*) FILTER (WHERE is_failure) AS failed_deploys,
COUNT(*) AS total_deploys,
ROUND(
COUNT(*) FILTER (WHERE is_failure)::numeric / COUNT(*)::numeric * 100, 1
) AS cfr_percent
FROM deployment_log
WHERE environment = 'production'
AND deployed_at > NOW() - INTERVAL '30 days';
LABELING_FAILURES¶
METHOD: marta/iwona labels deploys as failures retroactively
TRIGGER: any of:
- rollback deploy created (deployment_log.rollback_of is not null)
- incident created referencing deploy commit
- hotfix PR created within 7 days referencing same area of code
REDUCTION_STRATEGIES¶
STRATEGY: increase test coverage on changed code
MEASURE: require coverage diff >= 80% on PR
STRATEGY: expand e2e test scenarios
MEASURE: Playwright tests cover all critical user paths
STRATEGY: canary deployments
MEASURE: deploy to 10% traffic first, monitor, then full rollout
STRATEGY: feature flags
MEASURE: ship code behind flag, enable incrementally
STRATEGY: pre-production smoke tests
MEASURE: automated health check after staging deploy
DORA:TIME_TO_RESTORE_SERVICE¶
DEFINITION¶
WHAT: time from production failure detection to service restoration
STARTS: first alert or user report
ENDS: service confirmed healthy
MEASURES: incident response effectiveness
HOW_TO_MEASURE¶
SOURCE: incident tracking (GitHub Issues with incident label)
-- TTRS for last 90 days
SELECT
AVG(resolved_at - detected_at) AS avg_ttrs,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY resolved_at - detected_at) AS median_ttrs,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY resolved_at - detected_at) AS p95_ttrs
FROM incidents
WHERE detected_at > NOW() - INTERVAL '90 days'
AND resolved_at IS NOT NULL;
INCIDENT_SEVERITY_AND_TTRS¶
| Severity | Definition | TTRS Target |
|---|---|---|
| SEV1 (critical) | service completely down | < 15 minutes |
| SEV2 (major) | major feature broken | < 30 minutes |
| SEV3 (minor) | degraded but functional | < 4 hours |
| SEV4 (low) | cosmetic or minor impact | < 24 hours |
OPTIMIZATION¶
STRATEGY: automated rollback on health check failure
IMPACT: reduces TTRS for deployment-caused failures to < 5 minutes
STRATEGY: runbooks for common failures
IMPACT: agents follow documented steps instead of investigating from scratch
STRATEGY: observability (structured logging, error tracking)
IMPACT: faster root cause identification
STRATEGY: incident response drills
IMPACT: agents practice recovery procedures before real incidents
SEE: domains/incident-response/ for detailed procedures
DORA:PER_PROJECT_TRACKING¶
DATA_MODEL¶
project
├── deployment_log[] (all deploys)
├── incidents[] (all production incidents)
├── pull_requests[] (all merged PRs)
└── dora_snapshots[] (weekly aggregated metrics)
WEEKLY_SNAPSHOT¶
CREATE TABLE dora_snapshots (
id SERIAL PRIMARY KEY,
project_id INTEGER NOT NULL,
week_start DATE NOT NULL,
deployment_frequency NUMERIC, -- deploys per day
lead_time_hours NUMERIC, -- median hours
change_failure_rate NUMERIC, -- percentage
time_to_restore_minutes NUMERIC, -- median minutes
tier TEXT, -- 'elite' | 'high' | 'medium' | 'low'
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(project_id, week_start)
);
TIER_CALCULATION¶
FUNCTION calculate_tier(df, lt, cfr, ttrs):
scores = []
scores.append('elite' if df >= 1 else 'high' if df >= 0.25 else 'medium' if df >= 0.033 else 'low')
scores.append('elite' if lt < 1 else 'high' if lt < 168 else 'medium' if lt < 4320 else 'low')
scores.append('elite' if cfr <= 5 else 'high' if cfr <= 10 else 'medium' if cfr <= 15 else 'low')
scores.append('elite' if ttrs < 60 else 'high' if ttrs < 1440 else 'medium' if ttrs < 10080 else 'low')
RETURN lowest of scores (weakest link determines tier)
RATIONALE: one weak metric pulls the whole tier down
ANALOGY: a chain is only as strong as its weakest link
DORA:TRENDING¶
VISUALIZATION¶
CHART: line chart per metric over time (weekly data points)
CHART: tier badge with color (elite=green, high=blue, medium=yellow, low=red)
CHART: metric breakdown by phase (coding, review, queue, deploy)
TREND_ALERTS¶
ALERT: metric degrades by one tier for 2 consecutive weeks
ACTION: eltjo flags in monitoring report, marta/iwona investigate
ALERT: change failure rate crosses 15%
ACTION: automatic freeze on production deploys until root cause found
ALERT: lead time exceeds 1 week
ACTION: faye/sytske reviews pipeline bottlenecks
REPORTING_CADENCE¶
DAILY: deployment count, any incidents (automated)
WEEKLY: full DORA snapshot with tier classification (marta/iwona)
MONTHLY: trend report with improvement recommendations (faye/sytske)
QUARTERLY: client-facing DORA summary (margot, part of client reporting)
DORA:IMPROVEMENT_FRAMEWORK¶
PRINCIPLES¶
PRINCIPLE: metrics drive improvement, not blame
PRINCIPLE: compare against yourself (trending), not others (ranking)
PRINCIPLE: improve the weakest metric first (bottleneck theory)
PRINCIPLE: small, continuous improvements beat big-bang changes
IMPROVEMENT_CYCLE¶
STEP: measure
ACTION: collect DORA snapshot for current week
STEP: diagnose
ACTION: identify weakest metric, decompose into phases
STEP: hypothesize
ACTION: propose specific change to improve bottleneck phase
STEP: experiment
ACTION: implement change for 2 weeks
STEP: measure
ACTION: compare new DORA snapshot against baseline
IF improved THEN adopt change permanently
IF no change THEN revert experiment, try different hypothesis
IF degraded THEN revert immediately
COMMON_IMPROVEMENTS¶
| Metric | Bottleneck | Improvement |
|---|---|---|
| DF | manual deploy approval | automate staging, reduce production approval to 1 click |
| DF | long CI pipeline | parallelize jobs, optimize caching |
| LT | slow code review | automate style checks, focus human review on logic |
| LT | large PRs | enforce size limits, coach agents on PR splitting |
| CFR | insufficient testing | increase e2e coverage, add contract tests |
| CFR | no canary deployments | implement progressive rollout |
| TTRS | no runbooks | document common failures and recovery steps |
| TTRS | slow detection | improve alerting, add health check endpoints |
DORA:GE_SPECIFIC_CONSIDERATIONS¶
AGENTIC_ADVANTAGE¶
GE's agent-based development should naturally excel at DORA metrics:
- agents work 24/7, no handoff delays (reduces lead time)
- agents follow process consistently (reduces change failure rate)
- automated merge gate catches issues early (reduces CFR)
- agents respond to incidents immediately (reduces TTRS)
AGENTIC_RISK¶
RISK: agents may optimize for deployment frequency at the expense of quality
MITIGATION: CFR is a counterbalancing metric — fast but broken doesn't count
RISK: metrics gaming (splitting deploys to inflate frequency)
MITIGATION: only count deploys that include user-facing changes
RISK: low TTRS because agents auto-rollback without root-causing
MITIGATION: track both TTRS and "time to root cause" separately
CLIENT_REPORTING¶
WHAT: quarterly DORA summary included in client reports
WHO: margot prepares, using data from marta/iwona
FORMAT: tier badge + trend chart + key improvements made
GOAL: demonstrate to clients that GE delivers at elite/high performance
SAMPLE:
Q1 2026 — Project: ClientName
Deployment Frequency: 2.3/day (Elite)
Lead Time: 3.2 hours (Elite)
Change Failure Rate: 4.1% (Elite)
Time to Restore: 22 min (Elite)
Overall: ELITE tier
Improvements this quarter:
- Reduced lead time from 8h to 3.2h by parallelizing CI jobs
- Reduced CFR from 8% to 4.1% by adding contract tests
DORA:PITFALLS¶
PITFALL: measuring deployment frequency by counting ALL workflow runs
IMPACT: inflated numbers from CI reruns, staging deploys
FIX: only count successful production deploy workflows
PITFALL: measuring lead time from merge (not first commit)
IMPACT: hides coding and review bottlenecks
FIX: always measure from first commit on feature branch
PITFALL: not labeling failed deployments retroactively
IMPACT: CFR appears artificially low
FIX: marta/iwona must review all incidents and link to causing deploy
PITFALL: using DORA metrics to compare agents against each other
IMPACT: gaming, blame culture, hiding problems
RULE: DORA metrics are always per-project, never per-agent
PITFALL: measuring TTRS from when engineer starts working (not when alert fires)
IMPACT: hides detection latency
FIX: TTRS starts at first alert timestamp, not first human response
PITFALL: optimizing one metric at the expense of others
IMPACT: fast deploys with high failure rate is worse than slower, reliable deploys
RULE: always track all four together, improve weakest first