Skip to content

DOMAIN:DEVOPS — RELEASE_READINESS_RUBRIC

OWNER: marta
ALSO_USED_BY: iwona (co-evaluator), koen (lint input), nessa (performance input), jasper (reconciliation input)
UPDATED: 2026-03-24
SCOPE: scoring rubric for release readiness assessment — used by Marta and Iwona to produce a numeric score for every PR and release candidate
PURPOSE: make merge/block decisions objective, reproducible, and auditable


SCORING_OVERVIEW

Every PR receives a score from 0 to 100.
Threshold: >= 70 to merge. < 70 is blocked.
The score is computed from 7 weighted criteria.
Each criterion is scored 0-10 with Pass/Partial/Fail definitions.
SEE: devops/merge-gate-calibration.md for worked examples at different score levels.


CRITERIA_TABLE

# Criterion Weight Source
1 Tests pass 25% CI pipeline, Marije
2 Spec traceability 20% Jasper reconciliation report
3 No test weakening 15% Diff analysis (Marta)
4 Koen clean 15% Koen lint report
5 Performance budget 10% Nessa performance report
6 Code churn 10% Git history analysis
7 PR size 5% Diff stats

FORMULA: Score = SUM(criterion_score / 10 * weight * 100)


CRITERION_1: TESTS PASS (25%)

Measures: Do all tests in the suite pass on the current build?

Score Definition
10 All tests pass. Zero failures, zero skips (or skips have tracked tickets).
8 All tests pass. 1-2 tests skipped with tracked tickets and documented reason.
6 All tests pass. 3-5 tests skipped. Skipped tests are not in critical paths.
4 1-3 test failures. Failures are in non-critical paths. Developer claims "known issue."
2 4+ test failures or any failure in a critical path (auth, payment, data integrity).
0 Test suite does not run, crashes, or has been disabled.

CRITICAL_PATHS: authentication, authorization, payment, data persistence, API contracts.
A single failure in a critical path caps this criterion at 2.

SKIP_RULES:
- Skipped tests MUST have a tracked ticket number in the skip reason
- Skipped tests MUST NOT be in critical paths
- More than 5 skips in a single suite suggests the suite needs attention, not more skips


CRITERION_2: SPEC TRACEABILITY (20%)

Measures: Can every spec requirement be traced to at least one test?

Score Definition
10 100% of spec requirements have corresponding tests. Mapping is documented.
8 90-99% coverage. Missing items are cosmetic (tooltips, placeholder text).
6 75-89% coverage. Missing items are non-critical but behavioral.
4 50-74% coverage. Significant spec items untested.
2 25-49% coverage. Most spec items untested.
0 No traceability. Tests exist but don't map to spec.

INPUT: Jasper's reconciliation report provides the coverage matrix.
If Jasper has not run reconciliation, Marta must request it before scoring.

SPEC_ITEM_CLASSIFICATION:
- CRITICAL: auth, payment, data integrity, API contracts → must be tested (mandatory)
- IMPORTANT: core user flows, business logic, error handling → should be tested
- COSMETIC: UI text, tooltips, animation timing → nice to have

Missing CRITICAL items cap this criterion at 3.


CRITERION_3: NO TEST WEAKENING (15%)

Measures: Did this PR delete, skip, or weaken any previously passing test assertions?

Score Definition
10 No test assertions deleted, skipped, or weakened. Test coverage maintained or improved.
8 Test assertions changed but functionally equivalent (e.g., renamed, restructured).
6 1-2 non-critical assertions removed with documented justification (behavior intentionally changed per spec).
4 3+ assertions removed, OR any critical-path assertion weakened, even with justification.
2 Test file deleted without replacement.
0 Systematic test weakening: multiple files, pattern of removing assertions to make tests pass.

HARD_BLOCK_RULE (TW-1): If any previously passing assertion in a critical path is deleted without spec-backed justification, this criterion scores 0 AND triggers a hard block regardless of total score.

WHAT_COUNTS_AS_WEAKENING:
- Deleting an expect() call
- Changing toBe(specificValue) to toBeDefined()
- Changing toHaveLength(5) to toHaveLength(expect.any(Number))
- Adding .skip to a previously passing test
- Wrapping a test in try/catch that swallows the assertion error

WHAT_DOES_NOT_COUNT:
- Updating an expected value because the spec changed (with Anna's spec change documented)
- Moving assertions to a different test file
- Replacing one assertion with a more specific one


CRITERION_4: KOEN CLEAN (15%)

Measures: Does the code pass lint, format, and static analysis checks?

Score Definition
10 Zero errors, zero warnings. Clean pass.
8 Zero errors, 1-3 warnings. Warnings are cosmetic (unused imports, line length).
6 Zero errors, 4+ warnings. Or 1 error that is a false positive with documented override.
4 1-2 errors. Errors are style-related, not logic-related.
2 3+ errors or any any type usage in TypeScript (type safety violation).
0 Lint does not run or has been disabled for this PR.

KOEN_REPORTS: Koen provides the lint report as part of the pipeline. If Koen has not run, Marta must request it.

ERROR_CLASSIFICATION:
- TYPE_SAFETY: any type, missing return types, unchecked nulls → always errors
- STYLE: naming conventions, import order, line length → warnings
- UNUSED: dead code, unused variables → warnings (but accumulated unused code is a smell)


CRITERION_5: PERFORMANCE BUDGET (10%)

Measures: Does the build stay within performance budgets?

Score Definition
10 All metrics within budget. No regressions detected.
8 All metrics within budget. Minor regression (< 10%) on non-critical endpoint.
6 One metric marginally over budget (< 15% over). Non-critical endpoint.
4 One metric significantly over budget (15-50% over). OR critical endpoint marginally over.
2 Multiple metrics over budget. OR critical endpoint significantly over.
0 Performance tests not run. OR any metric > 2x budget.

IF_NO_PERFORMANCE_DATA:
- New feature with no baseline: score N/A, weight redistributed to other criteria
- Performance tests skipped by developer: score 0 (not N/A — skipping is a choice)
- Performance tests flaky: score 5, flag for infrastructure investigation

SEE: performance/performance-rubric.md for per-metric thresholds.
SEE: performance/calibration-examples.md for pass/fail examples.


CRITERION_6: CODE CHURN (10%)

Measures: Is this code area stable, or is it being repeatedly patched?

Score Definition
10 All changed files have 0-1 changes in the past 30 days. Stable code.
8 Most files stable. 1 file has 2 changes in 14 days (common during active development).
6 2-3 files have 2 changes each in 14 days. Active development area.
4 Any file has 3+ changes in 14 days. Pattern suggests unstable code.
2 Multiple files with 3+ changes in 14 days. Systematic instability.
0 Same logic file changed 5+ times in 14 days. Fundamental design problem.

EXCLUDED_FROM_CHURN:
- Test files (expected to change alongside code)
- Configuration files (expected to change during setup)
- Auto-generated files (migrations, lockfiles)
- Documentation files

CHURN_CONTEXT:
- High churn during initial feature development (first 2 weeks) is normal — don't penalize
- High churn on bug fixes to the same file is a red flag — penalize
- Use git log to distinguish "building" from "patching"


CRITERION_7: PR SIZE (5%)

Measures: Is the PR appropriately scoped?

Score Definition
10 1-50 lines changed. Single concern.
8 51-150 lines changed. Clear scope.
6 151-300 lines changed. Reasonable for a feature.
4 301-500 lines changed. Should be reviewed for splitting opportunity.
2 501-1000 lines changed. Likely should be split.
0 1000+ lines changed. Almost certainly needs splitting.

ADJUSTMENTS:
- Auto-generated lines (migrations, lockfiles) are EXCLUDED from line count
- Test lines are counted at 50% weight (more tests is good, not a complexity risk)
- Renamed/moved files: count only the actual content changes, not the full file


SCORE_CALCULATION_EXAMPLE

PR: "Add user profile page with avatar upload"

Tests pass:       8/10 * 25% = 20.0
Spec traceability: 7/10 * 20% = 14.0
No test weakening: 10/10 * 15% = 15.0
Koen clean:       9/10 * 15% = 13.5
Performance:      6/10 * 10% =  6.0
Code churn:       8/10 * 10% =  8.0
PR size:          6/10 *  5% =  3.0
                              -----
TOTAL:                         79.5 → PASS (>= 70)


HARD_BLOCK_OVERRIDES

These conditions trigger a BLOCK regardless of total score:

Code Condition Triggered By
TW-1 Critical-path test assertion deleted without spec justification Criterion 3 = 0
TW-2 Test file deleted without replacement Criterion 3 <= 2
SEC-1 Security-relevant test coverage decreased Criterion 2 + 3 combined
DATA-1 Migration contains DROP/TRUNCATE without approval Manual flag
FAIL-1 Any test failure in auth, payment, or data persistence Criterion 1 <= 2

BREAK_GLASS_OVERRIDE

Conditions for passing a PR below threshold (production hotfix only):

Requirement Details
Active incident Production users are currently affected
Minimal change <= 20 lines changed
Hotfix label Applied by authorized team member
Follow-up tracking 48-hour deadline for proper fix with full tests
Score recorded The below-threshold score is still recorded for audit

Break-glass does NOT apply to: deadline pressure, client demos, "we'll fix it later" without tracked follow-up.


WEIGHT_ADJUSTMENTS_BY_PR_TYPE

PR Type Tests Spec Weakening Koen Perf Churn Size
Feature (default) 25% 20% 15% 15% 10% 10% 5%
Test-only 30% 30% 15% 10% 0% 10% 5%
Config-only 10% 10% 10% 20% 5% 25% 20%
Hotfix 30% 10% 20% 10% 5% 20% 5%
Migration-only 15% 20% 15% 10% 10% 15% 15%

Marta selects the PR type based on the content of the diff. Mixed PRs use the default weights.