DOMAIN:DEVOPS — RELEASE_READINESS_RUBRIC¶
OWNER: marta
ALSO_USED_BY: iwona (co-evaluator), koen (lint input), nessa (performance input), jasper (reconciliation input)
UPDATED: 2026-03-24
SCOPE: scoring rubric for release readiness assessment — used by Marta and Iwona to produce a numeric score for every PR and release candidate
PURPOSE: make merge/block decisions objective, reproducible, and auditable
SCORING_OVERVIEW¶
Every PR receives a score from 0 to 100.
Threshold: >= 70 to merge. < 70 is blocked.
The score is computed from 7 weighted criteria.
Each criterion is scored 0-10 with Pass/Partial/Fail definitions.
SEE: devops/merge-gate-calibration.md for worked examples at different score levels.
CRITERIA_TABLE¶
| # | Criterion | Weight | Source |
|---|---|---|---|
| 1 | Tests pass | 25% | CI pipeline, Marije |
| 2 | Spec traceability | 20% | Jasper reconciliation report |
| 3 | No test weakening | 15% | Diff analysis (Marta) |
| 4 | Koen clean | 15% | Koen lint report |
| 5 | Performance budget | 10% | Nessa performance report |
| 6 | Code churn | 10% | Git history analysis |
| 7 | PR size | 5% | Diff stats |
FORMULA: Score = SUM(criterion_score / 10 * weight * 100)
CRITERION_1: TESTS PASS (25%)¶
Measures: Do all tests in the suite pass on the current build?
| Score | Definition |
|---|---|
| 10 | All tests pass. Zero failures, zero skips (or skips have tracked tickets). |
| 8 | All tests pass. 1-2 tests skipped with tracked tickets and documented reason. |
| 6 | All tests pass. 3-5 tests skipped. Skipped tests are not in critical paths. |
| 4 | 1-3 test failures. Failures are in non-critical paths. Developer claims "known issue." |
| 2 | 4+ test failures or any failure in a critical path (auth, payment, data integrity). |
| 0 | Test suite does not run, crashes, or has been disabled. |
CRITICAL_PATHS: authentication, authorization, payment, data persistence, API contracts.
A single failure in a critical path caps this criterion at 2.
SKIP_RULES:
- Skipped tests MUST have a tracked ticket number in the skip reason
- Skipped tests MUST NOT be in critical paths
- More than 5 skips in a single suite suggests the suite needs attention, not more skips
CRITERION_2: SPEC TRACEABILITY (20%)¶
Measures: Can every spec requirement be traced to at least one test?
| Score | Definition |
|---|---|
| 10 | 100% of spec requirements have corresponding tests. Mapping is documented. |
| 8 | 90-99% coverage. Missing items are cosmetic (tooltips, placeholder text). |
| 6 | 75-89% coverage. Missing items are non-critical but behavioral. |
| 4 | 50-74% coverage. Significant spec items untested. |
| 2 | 25-49% coverage. Most spec items untested. |
| 0 | No traceability. Tests exist but don't map to spec. |
INPUT: Jasper's reconciliation report provides the coverage matrix.
If Jasper has not run reconciliation, Marta must request it before scoring.
SPEC_ITEM_CLASSIFICATION:
- CRITICAL: auth, payment, data integrity, API contracts → must be tested (mandatory)
- IMPORTANT: core user flows, business logic, error handling → should be tested
- COSMETIC: UI text, tooltips, animation timing → nice to have
Missing CRITICAL items cap this criterion at 3.
CRITERION_3: NO TEST WEAKENING (15%)¶
Measures: Did this PR delete, skip, or weaken any previously passing test assertions?
| Score | Definition |
|---|---|
| 10 | No test assertions deleted, skipped, or weakened. Test coverage maintained or improved. |
| 8 | Test assertions changed but functionally equivalent (e.g., renamed, restructured). |
| 6 | 1-2 non-critical assertions removed with documented justification (behavior intentionally changed per spec). |
| 4 | 3+ assertions removed, OR any critical-path assertion weakened, even with justification. |
| 2 | Test file deleted without replacement. |
| 0 | Systematic test weakening: multiple files, pattern of removing assertions to make tests pass. |
HARD_BLOCK_RULE (TW-1): If any previously passing assertion in a critical path is deleted without spec-backed justification, this criterion scores 0 AND triggers a hard block regardless of total score.
WHAT_COUNTS_AS_WEAKENING:
- Deleting an expect() call
- Changing toBe(specificValue) to toBeDefined()
- Changing toHaveLength(5) to toHaveLength(expect.any(Number))
- Adding .skip to a previously passing test
- Wrapping a test in try/catch that swallows the assertion error
WHAT_DOES_NOT_COUNT:
- Updating an expected value because the spec changed (with Anna's spec change documented)
- Moving assertions to a different test file
- Replacing one assertion with a more specific one
CRITERION_4: KOEN CLEAN (15%)¶
Measures: Does the code pass lint, format, and static analysis checks?
| Score | Definition |
|---|---|
| 10 | Zero errors, zero warnings. Clean pass. |
| 8 | Zero errors, 1-3 warnings. Warnings are cosmetic (unused imports, line length). |
| 6 | Zero errors, 4+ warnings. Or 1 error that is a false positive with documented override. |
| 4 | 1-2 errors. Errors are style-related, not logic-related. |
| 2 | 3+ errors or any any type usage in TypeScript (type safety violation). |
| 0 | Lint does not run or has been disabled for this PR. |
KOEN_REPORTS: Koen provides the lint report as part of the pipeline. If Koen has not run, Marta must request it.
ERROR_CLASSIFICATION:
- TYPE_SAFETY: any type, missing return types, unchecked nulls → always errors
- STYLE: naming conventions, import order, line length → warnings
- UNUSED: dead code, unused variables → warnings (but accumulated unused code is a smell)
CRITERION_5: PERFORMANCE BUDGET (10%)¶
Measures: Does the build stay within performance budgets?
| Score | Definition |
|---|---|
| 10 | All metrics within budget. No regressions detected. |
| 8 | All metrics within budget. Minor regression (< 10%) on non-critical endpoint. |
| 6 | One metric marginally over budget (< 15% over). Non-critical endpoint. |
| 4 | One metric significantly over budget (15-50% over). OR critical endpoint marginally over. |
| 2 | Multiple metrics over budget. OR critical endpoint significantly over. |
| 0 | Performance tests not run. OR any metric > 2x budget. |
IF_NO_PERFORMANCE_DATA:
- New feature with no baseline: score N/A, weight redistributed to other criteria
- Performance tests skipped by developer: score 0 (not N/A — skipping is a choice)
- Performance tests flaky: score 5, flag for infrastructure investigation
SEE: performance/performance-rubric.md for per-metric thresholds.
SEE: performance/calibration-examples.md for pass/fail examples.
CRITERION_6: CODE CHURN (10%)¶
Measures: Is this code area stable, or is it being repeatedly patched?
| Score | Definition |
|---|---|
| 10 | All changed files have 0-1 changes in the past 30 days. Stable code. |
| 8 | Most files stable. 1 file has 2 changes in 14 days (common during active development). |
| 6 | 2-3 files have 2 changes each in 14 days. Active development area. |
| 4 | Any file has 3+ changes in 14 days. Pattern suggests unstable code. |
| 2 | Multiple files with 3+ changes in 14 days. Systematic instability. |
| 0 | Same logic file changed 5+ times in 14 days. Fundamental design problem. |
EXCLUDED_FROM_CHURN:
- Test files (expected to change alongside code)
- Configuration files (expected to change during setup)
- Auto-generated files (migrations, lockfiles)
- Documentation files
CHURN_CONTEXT:
- High churn during initial feature development (first 2 weeks) is normal — don't penalize
- High churn on bug fixes to the same file is a red flag — penalize
- Use git log to distinguish "building" from "patching"
CRITERION_7: PR SIZE (5%)¶
Measures: Is the PR appropriately scoped?
| Score | Definition |
|---|---|
| 10 | 1-50 lines changed. Single concern. |
| 8 | 51-150 lines changed. Clear scope. |
| 6 | 151-300 lines changed. Reasonable for a feature. |
| 4 | 301-500 lines changed. Should be reviewed for splitting opportunity. |
| 2 | 501-1000 lines changed. Likely should be split. |
| 0 | 1000+ lines changed. Almost certainly needs splitting. |
ADJUSTMENTS:
- Auto-generated lines (migrations, lockfiles) are EXCLUDED from line count
- Test lines are counted at 50% weight (more tests is good, not a complexity risk)
- Renamed/moved files: count only the actual content changes, not the full file
SCORE_CALCULATION_EXAMPLE¶
PR: "Add user profile page with avatar upload"
Tests pass: 8/10 * 25% = 20.0
Spec traceability: 7/10 * 20% = 14.0
No test weakening: 10/10 * 15% = 15.0
Koen clean: 9/10 * 15% = 13.5
Performance: 6/10 * 10% = 6.0
Code churn: 8/10 * 10% = 8.0
PR size: 6/10 * 5% = 3.0
-----
TOTAL: 79.5 → PASS (>= 70)
HARD_BLOCK_OVERRIDES¶
These conditions trigger a BLOCK regardless of total score:
| Code | Condition | Triggered By |
|---|---|---|
| TW-1 | Critical-path test assertion deleted without spec justification | Criterion 3 = 0 |
| TW-2 | Test file deleted without replacement | Criterion 3 <= 2 |
| SEC-1 | Security-relevant test coverage decreased | Criterion 2 + 3 combined |
| DATA-1 | Migration contains DROP/TRUNCATE without approval | Manual flag |
| FAIL-1 | Any test failure in auth, payment, or data persistence | Criterion 1 <= 2 |
BREAK_GLASS_OVERRIDE¶
Conditions for passing a PR below threshold (production hotfix only):
| Requirement | Details |
|---|---|
| Active incident | Production users are currently affected |
| Minimal change | <= 20 lines changed |
| Hotfix label | Applied by authorized team member |
| Follow-up tracking | 48-hour deadline for proper fix with full tests |
| Score recorded | The below-threshold score is still recorded for audit |
Break-glass does NOT apply to: deadline pressure, client demos, "we'll fix it later" without tracked follow-up.
WEIGHT_ADJUSTMENTS_BY_PR_TYPE¶
| PR Type | Tests | Spec | Weakening | Koen | Perf | Churn | Size |
|---|---|---|---|---|---|---|---|
| Feature (default) | 25% | 20% | 15% | 15% | 10% | 10% | 5% |
| Test-only | 30% | 30% | 15% | 10% | 0% | 10% | 5% |
| Config-only | 10% | 10% | 10% | 20% | 5% | 25% | 20% |
| Hotfix | 30% | 10% | 20% | 10% | 5% | 20% | 5% |
| Migration-only | 15% | 20% | 15% | 10% | 10% | 15% | 15% |
Marta selects the PR type based on the content of the diff. Mixed PRs use the default weights.