Skip to content

DOMAIN:DEVOPS — MERGE_GATE_CALIBRATION

OWNER: marta
ALSO_USED_BY: iwona (change intelligence), jasper (reconciliation input), koen (lint status)
UPDATED: 2026-03-24
SCOPE: calibration examples for merge gate evaluation — JIT injected before every PR review and release readiness assessment
PURPOSE: ensure consistent merge/block decisions by Marta and Iwona across all client PRs with concrete scored examples


HOW_TO_USE_THIS_PAGE

Read these examples BEFORE evaluating any PR or release candidate.
Marta and Iwona decide: does this PR merge or get blocked?
Every decision must be defensible. "I felt uneasy" is not a reason. Numbers and criteria are.

MERGE_GATE_PRINCIPLES:
- The gate protects clients from shipping broken software
- Blocking is expensive (delays delivery, costs money) — only block when the risk is real
- Passing is dangerous (ships defects) — only pass when confidence is earned
- When in doubt, ask for more information rather than guessing pass/block
- Auto-generated code, migrations, and config changes have different risk profiles than logic changes

SCORE_RANGE: 0-100
THRESHOLD: >= 70 to merge, < 70 blocked
SEE_ALSO: devops/release-readiness-rubric.md for full scoring criteria and weights


EXAMPLE_1: PR THAT SHOULD BE BLOCKED — TEST WEAKENING DETECTED

RELEASE_READINESS_SCORE: 35/100

DECISION: BLOCK

PR_SUMMARY

Title: "Refactor payment validation"
Files changed: 4
Lines added: 12, deleted: 48

Key change in payment.test.ts:

- it('should reject payments exceeding daily limit', async () => {
-   const result = await processPayment({ amount: 100000 });
-   expect(result.success).toBe(false);
-   expect(result.error).toBe('DAILY_LIMIT_EXCEEDED');
- });
-
- it('should reject payments with invalid currency codes', async () => {
-   const result = await processPayment({ amount: 100, currency: 'INVALID' });
-   expect(result.success).toBe(false);
- });

+ it('should process payments', async () => {
+   const result = await processPayment({ amount: 100 });
+   expect(result).toBeDefined();
+ });

SCORE_BREAKDOWN

Criteria Weight Score Weighted
Tests pass 25% 10/10 25
Spec traceability 20% 2/10 4
No test weakening 15% 0/10 0
Koen clean 15% 8/10 12
Performance budget 10% N/A 0
Code churn 10% 5/10 5
PR size 5% 8/10 4
TOTAL 35 (adjusted from 50 — test weakening override)

WHY_BLOCKED:
- Two specific, meaningful assertions (daily limit, invalid currency) were DELETED
- Replaced with a single assertion that only checks the object exists — this is test theater
- Tests "pass" because the new test can never fail, not because the code is correct
- TEST_WEAKENING is a HARD BLOCK regardless of other scores — see rubric rule TW-1
- The daily limit test protects against real financial exposure
- Net effect: payment validation coverage went from 85% to 30%

WHAT_DEVELOPER_SHOULD_DO:
- Restore the deleted test assertions
- If the refactor changed the API shape, update the tests to match the new shape — do NOT delete them
- If the daily limit behavior was intentionally removed, that requires a spec change from Anna first

EVALUATOR_ACTION: BLOCK. Flag TEST_WEAKENING. Return specific deleted assertions in the review comment.


EXAMPLE_2: PR THAT LOOKS BIG BUT IS SAFE — AUTO-GENERATED MIGRATION

RELEASE_READINESS_SCORE: 82/100

DECISION: PASS

PR_SUMMARY

Title: "Add client project settings table"
Files changed: 3
Lines added: 342, deleted: 0

Files:
1. drizzle/migrations/0015_add_project_settings.sql — 310 lines (auto-generated by drizzle-kit)
2. lib/db/schema.ts — 22 lines added (new table definition)
3. lib/db/schema.test.ts — 10 lines added (new table type test)

SCORE_BREAKDOWN

Criteria Weight Score Weighted
Tests pass 25% 7/10 17.5
Spec traceability 20% 8/10 16
No test weakening 15% 10/10 15
Koen clean 15% 9/10 13.5
Performance budget 10% N/A 10
Code churn 10% 10/10 10
PR size 5% 3/10 1.5
TOTAL 82 (adjusted — migration line count discounted)

WHY_PASSED_DESPITE_342_LINES:
- 310 of 342 lines are auto-generated SQL migration — not human-written logic
- Auto-generated migrations have a different risk profile: they are deterministic output of the schema definition
- The actual human-written code is 32 lines (schema + test)
- PR size score is low (3/10) but weight is only 5%, and migration lines are discounted in the adjusted calculation
- No test weakening, spec traceability is clear (project settings table matches spec), lint is clean

WHAT_TO_VERIFY:
- Migration is reversible (has a DOWN migration or drizzle can generate one)
- No data-destructive operations (DROP, TRUNCATE, ALTER TYPE) in the migration
- New table has appropriate indexes for expected query patterns
- Schema test verifies the TypeScript types match the SQL columns

EVALUATOR_ACTION: PASS with note: "342 lines, but 310 are auto-generated migration. Human-written changes are minimal and well-tested."


EXAMPLE_3: PR WITH CODE CHURN — SAME FILE CHANGED 3 TIMES IN 2 WEEKS

RELEASE_READINESS_SCORE: 58/100

DECISION: BLOCK

PR_SUMMARY

Title: "Fix cart total calculation (again)"
Files changed: 2
Lines added: 8, deleted: 5

Git log for lib/cart/total.ts:

2026-03-22 Fix cart total for multi-currency (#412)
2026-03-18 Fix cart total rounding (#398)
2026-03-10 Update cart total calculation (#385)

SCORE_BREAKDOWN

Criteria Weight Score Weighted
Tests pass 25% 8/10 20
Spec traceability 20% 6/10 12
No test weakening 15% 10/10 15
Koen clean 15% 7/10 10.5
Performance budget 10% N/A 0
Code churn 10% 1/10 1
PR size 5% 10/10 5
TOTAL 58 (adjusted — churn pattern penalty)

WHY_BLOCKED:
- Same file has been "fixed" 3 times in 14 days
- This pattern indicates the root cause is not understood
- Each fix addresses a symptom (rounding, multi-currency) but the underlying calculation model may be wrong
- The current PR is 8 lines — individually fine. But shipping another bandaid fix accumulates tech debt and risk
- If this file breaks again in production, client trust is damaged

CHURN_THRESHOLD:
- 2 changes in 14 days to the same logic file: WARNING (note in review, don't block)
- 3+ changes in 14 days to the same logic file: BLOCK for architectural review
- Exception: test files, config files, and auto-generated files are excluded from churn tracking

WHAT_DEVELOPER_SHOULD_DO:
- Step back and review the full calculation model, not just this fix
- Write a comprehensive test suite for cart total: multi-currency, rounding, discounts, tax, shipping
- Consider whether the cart total needs a dedicated, well-tested pure function with no side effects
- Submit the fix alongside the comprehensive tests in a single PR

EVALUATOR_ACTION: BLOCK. Flag CODE_CHURN. Recommend architectural review of cart total calculation before merging another patch.


EXAMPLE_4: EMERGENCY BREAK-GLASS PR — PRODUCTION FIX

RELEASE_READINESS_SCORE: 62/100 (BELOW THRESHOLD)

DECISION: PASS (BREAK-GLASS OVERRIDE)

PR_SUMMARY

Title: "[HOTFIX] Fix null pointer in user authentication"
Files changed: 1
Lines added: 3, deleted: 1
Labels: hotfix, production-incident

  const user = await db.users.findById(session.userId);
- const permissions = user.roles.flatMap(r => r.permissions);
+ const permissions = user?.roles?.flatMap(r => r.permissions) ?? [];

Incident: Users with deleted accounts can still have active sessions. When they hit any authenticated endpoint, the app crashes because user is null.

SCORE_BREAKDOWN

Criteria Weight Score Weighted
Tests pass 25% 5/10 12.5
Spec traceability 20% 3/10 6
No test weakening 15% 10/10 15
Koen clean 15% 10/10 15
Performance budget 10% N/A 0
Code churn 10% 8/10 8
PR size 5% 10/10 5
TOTAL 62

WHY_PASS_DESPITE_BELOW_THRESHOLD:
- BREAK-GLASS conditions are met:
1. Production incident is active (users are affected NOW)
2. Fix is minimal and surgical (3 lines, single file)
3. Fix is obviously correct (null-safe access on a nullable value)
4. Risk of NOT merging exceeds risk of merging
5. PR has hotfix label applied by authorized team member

BREAK_GLASS_RULES:
- Score is still calculated and recorded — it does not disappear
- A follow-up PR with proper tests MUST be filed within 48 hours
- The follow-up PR is tracked: if not merged within 48 hours, escalate
- Break-glass can only be invoked for production incidents, not for deadline pressure
- Break-glass PRs are limited to 20 lines changed — larger changes require normal review

WHAT_MUST_HAPPEN_NEXT:
- File follow-up ticket: "Add test for deleted-user-with-active-session path"
- The follow-up test should cover: user deleted mid-session, user suspended mid-session, user roles changed mid-session
- Consider: should session invalidation happen on user deletion? (spec question for Anna)

EVALUATOR_ACTION: PASS with BREAK_GLASS flag. Log the score. Create 48-hour follow-up tracker. Note specific tests required in follow-up.


EXAMPLE_5: RELEASE READINESS — PASS VS BLOCK COMPARISON

SCENARIO_A: SCORE 85 — PASS

PR: "Implement project dashboard with analytics"
Files: 12 changed, +480/-20

Criteria Weight Score Rationale Weighted
Tests pass 25% 9/10 All 47 tests pass, 2 flaky tests marked skip with ticket 22.5
Spec traceability 20% 8/10 11 of 12 spec items have corresponding tests, 1 deferred (tooltip wording) 16
No test weakening 15% 10/10 No deleted or weakened assertions 15
Koen clean 15% 9/10 1 warning (unused import), 0 errors 13.5
Performance budget 10% 7/10 LCP 1.8s (budget: 2.0s), CLS 0.05 (budget: 0.1) 7
Code churn 10% 8/10 New feature, no repeat changes to existing files 8
PR size 5% 6/10 480 lines is moderate, but it's a complete feature 3
TOTAL 85

VERDICT: PASS. Well-tested new feature. Minor gaps (tooltip, 2 flaky tests) are documented and tracked. Performance is within budget.

SCENARIO_B: SCORE 55 — BLOCK

PR: "Implement project dashboard with analytics" (same feature, different execution)
Files: 18 changed, +680/-120

Criteria Weight Score Rationale Weighted
Tests pass 25% 6/10 38 of 44 tests pass, 6 failures marked as "known issues" 15
Spec traceability 20% 4/10 Only 5 of 12 spec items have tests, error paths untested 8
No test weakening 15% 5/10 3 previously passing assertions changed to .skip 7.5
Koen clean 15% 6/10 4 warnings, 1 error (any type usage) 9
Performance budget 10% 3/10 LCP 2.8s (budget: 2.0s), bundle added 180KB 3
Code churn 10% 4/10 Modified 4 shared utility files with unclear scope 4
PR size 5% 4/10 680 lines with 120 deletions suggests significant refactoring mixed with feature 2
TOTAL 49 (adjusted from 48.5)

VERDICT: BLOCK. Multiple issues:
- 6 test failures dismissed as "known issues" is a red flag — tests should pass or be fixed
- Only 42% spec coverage is below minimum viable
- 3 skipped assertions is mild test weakening — those tests passed before and now don't
- Performance regression: 40% over LCP budget
- Mixed refactoring + feature in one PR makes review difficult

WHAT_DEVELOPER_SHOULD_DO:
- Split into two PRs: one for the refactoring, one for the feature
- Fix or properly address the 6 failing tests
- Restore the 3 skipped assertions or explain why the behavior changed
- Optimize LCP — 180KB bundle addition needs investigation (lazy loading? tree shaking?)
- Add tests for the 7 missing spec items


OVERRIDE_RULES

HARD_BLOCKS (override score to BLOCK regardless of total):
- TW-1: Any deleted assertion that was previously passing in a non-hotfix PR
- TW-2: Test file deleted without replacement
- SEC-1: Security-relevant test coverage decreased
- DATA-1: Migration contains DROP or TRUNCATE without explicit approval

HARD_PASSES (only via break-glass):
- BG-1: Active production incident + minimal fix + hotfix label
- BG-2: Maximum 20 lines changed
- BG-3: 48-hour follow-up PR requirement

SCORE_ADJUSTMENTS:
- Auto-generated files: line count excluded from PR size calculation
- Migration files: scored separately (schema correctness, reversibility)
- Test-only PRs: spec traceability weight increases to 30%, PR size weight drops to 0%
- Config-only PRs: performance budget and test pass weights swap (config rarely needs tests)