DOMAIN:DEVOPS — MERGE_GATE_CALIBRATION¶
OWNER: marta
ALSO_USED_BY: iwona (change intelligence), jasper (reconciliation input), koen (lint status)
UPDATED: 2026-03-24
SCOPE: calibration examples for merge gate evaluation — JIT injected before every PR review and release readiness assessment
PURPOSE: ensure consistent merge/block decisions by Marta and Iwona across all client PRs with concrete scored examples
HOW_TO_USE_THIS_PAGE¶
Read these examples BEFORE evaluating any PR or release candidate.
Marta and Iwona decide: does this PR merge or get blocked?
Every decision must be defensible. "I felt uneasy" is not a reason. Numbers and criteria are.
MERGE_GATE_PRINCIPLES:
- The gate protects clients from shipping broken software
- Blocking is expensive (delays delivery, costs money) — only block when the risk is real
- Passing is dangerous (ships defects) — only pass when confidence is earned
- When in doubt, ask for more information rather than guessing pass/block
- Auto-generated code, migrations, and config changes have different risk profiles than logic changes
SCORE_RANGE: 0-100
THRESHOLD: >= 70 to merge, < 70 blocked
SEE_ALSO: devops/release-readiness-rubric.md for full scoring criteria and weights
EXAMPLE_1: PR THAT SHOULD BE BLOCKED — TEST WEAKENING DETECTED¶
RELEASE_READINESS_SCORE: 35/100¶
DECISION: BLOCK¶
PR_SUMMARY¶
Title: "Refactor payment validation"
Files changed: 4
Lines added: 12, deleted: 48
Key change in payment.test.ts:
- it('should reject payments exceeding daily limit', async () => {
- const result = await processPayment({ amount: 100000 });
- expect(result.success).toBe(false);
- expect(result.error).toBe('DAILY_LIMIT_EXCEEDED');
- });
-
- it('should reject payments with invalid currency codes', async () => {
- const result = await processPayment({ amount: 100, currency: 'INVALID' });
- expect(result.success).toBe(false);
- });
+ it('should process payments', async () => {
+ const result = await processPayment({ amount: 100 });
+ expect(result).toBeDefined();
+ });
SCORE_BREAKDOWN¶
| Criteria | Weight | Score | Weighted |
|---|---|---|---|
| Tests pass | 25% | 10/10 | 25 |
| Spec traceability | 20% | 2/10 | 4 |
| No test weakening | 15% | 0/10 | 0 |
| Koen clean | 15% | 8/10 | 12 |
| Performance budget | 10% | N/A | 0 |
| Code churn | 10% | 5/10 | 5 |
| PR size | 5% | 8/10 | 4 |
| TOTAL | 35 (adjusted from 50 — test weakening override) |
WHY_BLOCKED:
- Two specific, meaningful assertions (daily limit, invalid currency) were DELETED
- Replaced with a single assertion that only checks the object exists — this is test theater
- Tests "pass" because the new test can never fail, not because the code is correct
- TEST_WEAKENING is a HARD BLOCK regardless of other scores — see rubric rule TW-1
- The daily limit test protects against real financial exposure
- Net effect: payment validation coverage went from 85% to 30%
WHAT_DEVELOPER_SHOULD_DO:
- Restore the deleted test assertions
- If the refactor changed the API shape, update the tests to match the new shape — do NOT delete them
- If the daily limit behavior was intentionally removed, that requires a spec change from Anna first
EVALUATOR_ACTION: BLOCK. Flag TEST_WEAKENING. Return specific deleted assertions in the review comment.
EXAMPLE_2: PR THAT LOOKS BIG BUT IS SAFE — AUTO-GENERATED MIGRATION¶
RELEASE_READINESS_SCORE: 82/100¶
DECISION: PASS¶
PR_SUMMARY¶
Title: "Add client project settings table"
Files changed: 3
Lines added: 342, deleted: 0
Files:
1. drizzle/migrations/0015_add_project_settings.sql — 310 lines (auto-generated by drizzle-kit)
2. lib/db/schema.ts — 22 lines added (new table definition)
3. lib/db/schema.test.ts — 10 lines added (new table type test)
SCORE_BREAKDOWN¶
| Criteria | Weight | Score | Weighted |
|---|---|---|---|
| Tests pass | 25% | 7/10 | 17.5 |
| Spec traceability | 20% | 8/10 | 16 |
| No test weakening | 15% | 10/10 | 15 |
| Koen clean | 15% | 9/10 | 13.5 |
| Performance budget | 10% | N/A | 10 |
| Code churn | 10% | 10/10 | 10 |
| PR size | 5% | 3/10 | 1.5 |
| TOTAL | 82 (adjusted — migration line count discounted) |
WHY_PASSED_DESPITE_342_LINES:
- 310 of 342 lines are auto-generated SQL migration — not human-written logic
- Auto-generated migrations have a different risk profile: they are deterministic output of the schema definition
- The actual human-written code is 32 lines (schema + test)
- PR size score is low (3/10) but weight is only 5%, and migration lines are discounted in the adjusted calculation
- No test weakening, spec traceability is clear (project settings table matches spec), lint is clean
WHAT_TO_VERIFY:
- Migration is reversible (has a DOWN migration or drizzle can generate one)
- No data-destructive operations (DROP, TRUNCATE, ALTER TYPE) in the migration
- New table has appropriate indexes for expected query patterns
- Schema test verifies the TypeScript types match the SQL columns
EVALUATOR_ACTION: PASS with note: "342 lines, but 310 are auto-generated migration. Human-written changes are minimal and well-tested."
EXAMPLE_3: PR WITH CODE CHURN — SAME FILE CHANGED 3 TIMES IN 2 WEEKS¶
RELEASE_READINESS_SCORE: 58/100¶
DECISION: BLOCK¶
PR_SUMMARY¶
Title: "Fix cart total calculation (again)"
Files changed: 2
Lines added: 8, deleted: 5
Git log for lib/cart/total.ts:
2026-03-22 Fix cart total for multi-currency (#412)
2026-03-18 Fix cart total rounding (#398)
2026-03-10 Update cart total calculation (#385)
SCORE_BREAKDOWN¶
| Criteria | Weight | Score | Weighted |
|---|---|---|---|
| Tests pass | 25% | 8/10 | 20 |
| Spec traceability | 20% | 6/10 | 12 |
| No test weakening | 15% | 10/10 | 15 |
| Koen clean | 15% | 7/10 | 10.5 |
| Performance budget | 10% | N/A | 0 |
| Code churn | 10% | 1/10 | 1 |
| PR size | 5% | 10/10 | 5 |
| TOTAL | 58 (adjusted — churn pattern penalty) |
WHY_BLOCKED:
- Same file has been "fixed" 3 times in 14 days
- This pattern indicates the root cause is not understood
- Each fix addresses a symptom (rounding, multi-currency) but the underlying calculation model may be wrong
- The current PR is 8 lines — individually fine. But shipping another bandaid fix accumulates tech debt and risk
- If this file breaks again in production, client trust is damaged
CHURN_THRESHOLD:
- 2 changes in 14 days to the same logic file: WARNING (note in review, don't block)
- 3+ changes in 14 days to the same logic file: BLOCK for architectural review
- Exception: test files, config files, and auto-generated files are excluded from churn tracking
WHAT_DEVELOPER_SHOULD_DO:
- Step back and review the full calculation model, not just this fix
- Write a comprehensive test suite for cart total: multi-currency, rounding, discounts, tax, shipping
- Consider whether the cart total needs a dedicated, well-tested pure function with no side effects
- Submit the fix alongside the comprehensive tests in a single PR
EVALUATOR_ACTION: BLOCK. Flag CODE_CHURN. Recommend architectural review of cart total calculation before merging another patch.
EXAMPLE_4: EMERGENCY BREAK-GLASS PR — PRODUCTION FIX¶
RELEASE_READINESS_SCORE: 62/100 (BELOW THRESHOLD)¶
DECISION: PASS (BREAK-GLASS OVERRIDE)¶
PR_SUMMARY¶
Title: "[HOTFIX] Fix null pointer in user authentication"
Files changed: 1
Lines added: 3, deleted: 1
Labels: hotfix, production-incident
const user = await db.users.findById(session.userId);
- const permissions = user.roles.flatMap(r => r.permissions);
+ const permissions = user?.roles?.flatMap(r => r.permissions) ?? [];
Incident: Users with deleted accounts can still have active sessions. When they hit any authenticated endpoint, the app crashes because user is null.
SCORE_BREAKDOWN¶
| Criteria | Weight | Score | Weighted |
|---|---|---|---|
| Tests pass | 25% | 5/10 | 12.5 |
| Spec traceability | 20% | 3/10 | 6 |
| No test weakening | 15% | 10/10 | 15 |
| Koen clean | 15% | 10/10 | 15 |
| Performance budget | 10% | N/A | 0 |
| Code churn | 10% | 8/10 | 8 |
| PR size | 5% | 10/10 | 5 |
| TOTAL | 62 |
WHY_PASS_DESPITE_BELOW_THRESHOLD:
- BREAK-GLASS conditions are met:
1. Production incident is active (users are affected NOW)
2. Fix is minimal and surgical (3 lines, single file)
3. Fix is obviously correct (null-safe access on a nullable value)
4. Risk of NOT merging exceeds risk of merging
5. PR has hotfix label applied by authorized team member
BREAK_GLASS_RULES:
- Score is still calculated and recorded — it does not disappear
- A follow-up PR with proper tests MUST be filed within 48 hours
- The follow-up PR is tracked: if not merged within 48 hours, escalate
- Break-glass can only be invoked for production incidents, not for deadline pressure
- Break-glass PRs are limited to 20 lines changed — larger changes require normal review
WHAT_MUST_HAPPEN_NEXT:
- File follow-up ticket: "Add test for deleted-user-with-active-session path"
- The follow-up test should cover: user deleted mid-session, user suspended mid-session, user roles changed mid-session
- Consider: should session invalidation happen on user deletion? (spec question for Anna)
EVALUATOR_ACTION: PASS with BREAK_GLASS flag. Log the score. Create 48-hour follow-up tracker. Note specific tests required in follow-up.
EXAMPLE_5: RELEASE READINESS — PASS VS BLOCK COMPARISON¶
SCENARIO_A: SCORE 85 — PASS¶
PR: "Implement project dashboard with analytics"
Files: 12 changed, +480/-20
| Criteria | Weight | Score | Rationale | Weighted |
|---|---|---|---|---|
| Tests pass | 25% | 9/10 | All 47 tests pass, 2 flaky tests marked skip with ticket | 22.5 |
| Spec traceability | 20% | 8/10 | 11 of 12 spec items have corresponding tests, 1 deferred (tooltip wording) | 16 |
| No test weakening | 15% | 10/10 | No deleted or weakened assertions | 15 |
| Koen clean | 15% | 9/10 | 1 warning (unused import), 0 errors | 13.5 |
| Performance budget | 10% | 7/10 | LCP 1.8s (budget: 2.0s), CLS 0.05 (budget: 0.1) | 7 |
| Code churn | 10% | 8/10 | New feature, no repeat changes to existing files | 8 |
| PR size | 5% | 6/10 | 480 lines is moderate, but it's a complete feature | 3 |
| TOTAL | 85 |
VERDICT: PASS. Well-tested new feature. Minor gaps (tooltip, 2 flaky tests) are documented and tracked. Performance is within budget.
SCENARIO_B: SCORE 55 — BLOCK¶
PR: "Implement project dashboard with analytics" (same feature, different execution)
Files: 18 changed, +680/-120
| Criteria | Weight | Score | Rationale | Weighted |
|---|---|---|---|---|
| Tests pass | 25% | 6/10 | 38 of 44 tests pass, 6 failures marked as "known issues" | 15 |
| Spec traceability | 20% | 4/10 | Only 5 of 12 spec items have tests, error paths untested | 8 |
| No test weakening | 15% | 5/10 | 3 previously passing assertions changed to .skip |
7.5 |
| Koen clean | 15% | 6/10 | 4 warnings, 1 error (any type usage) | 9 |
| Performance budget | 10% | 3/10 | LCP 2.8s (budget: 2.0s), bundle added 180KB | 3 |
| Code churn | 10% | 4/10 | Modified 4 shared utility files with unclear scope | 4 |
| PR size | 5% | 4/10 | 680 lines with 120 deletions suggests significant refactoring mixed with feature | 2 |
| TOTAL | 49 (adjusted from 48.5) |
VERDICT: BLOCK. Multiple issues:
- 6 test failures dismissed as "known issues" is a red flag — tests should pass or be fixed
- Only 42% spec coverage is below minimum viable
- 3 skipped assertions is mild test weakening — those tests passed before and now don't
- Performance regression: 40% over LCP budget
- Mixed refactoring + feature in one PR makes review difficult
WHAT_DEVELOPER_SHOULD_DO:
- Split into two PRs: one for the refactoring, one for the feature
- Fix or properly address the 6 failing tests
- Restore the 3 skipped assertions or explain why the behavior changed
- Optimize LCP — 180KB bundle addition needs investigation (lazy loading? tree shaking?)
- Add tests for the 7 missing spec items
OVERRIDE_RULES¶
HARD_BLOCKS (override score to BLOCK regardless of total):
- TW-1: Any deleted assertion that was previously passing in a non-hotfix PR
- TW-2: Test file deleted without replacement
- SEC-1: Security-relevant test coverage decreased
- DATA-1: Migration contains DROP or TRUNCATE without explicit approval
HARD_PASSES (only via break-glass):
- BG-1: Active production incident + minimal fix + hotfix label
- BG-2: Maximum 20 lines changed
- BG-3: 48-hour follow-up PR requirement
SCORE_ADJUSTMENTS:
- Auto-generated files: line count excluded from PR size calculation
- Migration files: scored separately (schema correctness, reversibility)
- Test-only PRs: spec traceability weight increases to 30%, PR size weight drops to 0%
- Config-only PRs: performance budget and test pass weights swap (config rarely needs tests)