DOMAIN:DEVOPS — MERGE_GATE_CALIBRATION¶

OWNER: marta
ALSO_USED_BY: iwona (change intelligence), jasper (reconciliation input), koen (lint status)
UPDATED: 2026-03-24
SCOPE: calibration examples for merge gate evaluation — JIT injected before every PR review and release readiness assessment
PURPOSE: ensure consistent merge/block decisions by Marta and Iwona across all client PRs with concrete scored examples

HOW_TO_USE_THIS_PAGE¶

Read these examples BEFORE evaluating any PR or release candidate.
Marta and Iwona decide: does this PR merge or get blocked?
Every decision must be defensible. "I felt uneasy" is not a reason. Numbers and criteria are.

MERGE_GATE_PRINCIPLES:
- The gate protects clients from shipping broken software
- Blocking is expensive (delays delivery, costs money) — only block when the risk is real
- Passing is dangerous (ships defects) — only pass when confidence is earned
- When in doubt, ask for more information rather than guessing pass/block
- Auto-generated code, migrations, and config changes have different risk profiles than logic changes

SCORE_RANGE: 0-100
THRESHOLD: >= 70 to merge, < 70 blocked
SEE_ALSO: devops/release-readiness-rubric.md for full scoring criteria and weights

EXAMPLE_1: PR THAT SHOULD BE BLOCKED — TEST WEAKENING DETECTED¶

RELEASE_READINESS_SCORE: 35/100¶

DECISION: BLOCK¶

PR_SUMMARY¶

Title: "Refactor payment validation"
Files changed: 4
Lines added: 12, deleted: 48

Key change in payment.test.ts:

- it('should reject payments exceeding daily limit', async () => {
-   const result = await processPayment({ amount: 100000 });
-   expect(result.success).toBe(false);
-   expect(result.error).toBe('DAILY_LIMIT_EXCEEDED');
- });
-
- it('should reject payments with invalid currency codes', async () => {
-   const result = await processPayment({ amount: 100, currency: 'INVALID' });
-   expect(result.success).toBe(false);
- });

+ it('should process payments', async () => {
+   const result = await processPayment({ amount: 100 });
+   expect(result).toBeDefined();
+ });

SCORE_BREAKDOWN¶

Criteria	Weight	Score	Weighted
Tests pass	25%	10/10	25
Spec traceability	20%	2/10	4
No test weakening	15%	0/10	0
Koen clean	15%	8/10	12
Performance budget	10%	N/A	0
Code churn	10%	5/10	5
PR size	5%	8/10	4
TOTAL			35 (adjusted from 50 — test weakening override)

WHY_BLOCKED:
- Two specific, meaningful assertions (daily limit, invalid currency) were DELETED
- Replaced with a single assertion that only checks the object exists — this is test theater
- Tests "pass" because the new test can never fail, not because the code is correct
- TEST_WEAKENING is a HARD BLOCK regardless of other scores — see rubric rule TW-1
- The daily limit test protects against real financial exposure
- Net effect: payment validation coverage went from 85% to 30%

WHAT_DEVELOPER_SHOULD_DO:
- Restore the deleted test assertions
- If the refactor changed the API shape, update the tests to match the new shape — do NOT delete them
- If the daily limit behavior was intentionally removed, that requires a spec change from Anna first

EVALUATOR_ACTION: BLOCK. Flag TEST_WEAKENING. Return specific deleted assertions in the review comment.

EXAMPLE_2: PR THAT LOOKS BIG BUT IS SAFE — AUTO-GENERATED MIGRATION¶

RELEASE_READINESS_SCORE: 82/100¶

DECISION: PASS¶

PR_SUMMARY¶

Title: "Add client project settings table"
Files changed: 3
Lines added: 342, deleted: 0

Files:
1. drizzle/migrations/0015_add_project_settings.sql — 310 lines (auto-generated by drizzle-kit)
2. lib/db/schema.ts — 22 lines added (new table definition)
3. lib/db/schema.test.ts — 10 lines added (new table type test)

SCORE_BREAKDOWN¶

Criteria	Weight	Score	Weighted
Tests pass	25%	7/10	17.5
Spec traceability	20%	8/10	16
No test weakening	15%	10/10	15
Koen clean	15%	9/10	13.5
Performance budget	10%	N/A	10
Code churn	10%	10/10	10
PR size	5%	3/10	1.5
TOTAL			82 (adjusted — migration line count discounted)

WHY_PASSED_DESPITE_342_LINES:
- 310 of 342 lines are auto-generated SQL migration — not human-written logic
- Auto-generated migrations have a different risk profile: they are deterministic output of the schema definition
- The actual human-written code is 32 lines (schema + test)
- PR size score is low (3/10) but weight is only 5%, and migration lines are discounted in the adjusted calculation
- No test weakening, spec traceability is clear (project settings table matches spec), lint is clean

WHAT_TO_VERIFY:
- Migration is reversible (has a DOWN migration or drizzle can generate one)
- No data-destructive operations (DROP, TRUNCATE, ALTER TYPE) in the migration
- New table has appropriate indexes for expected query patterns
- Schema test verifies the TypeScript types match the SQL columns

EVALUATOR_ACTION: PASS with note: "342 lines, but 310 are auto-generated migration. Human-written changes are minimal and well-tested."

EXAMPLE_3: PR WITH CODE CHURN — SAME FILE CHANGED 3 TIMES IN 2 WEEKS¶

RELEASE_READINESS_SCORE: 58/100¶

DECISION: BLOCK¶

PR_SUMMARY¶

Title: "Fix cart total calculation (again)"
Files changed: 2
Lines added: 8, deleted: 5

Git log for lib/cart/total.ts:

2026-03-22 Fix cart total for multi-currency (#412)
2026-03-18 Fix cart total rounding (#398)
2026-03-10 Update cart total calculation (#385)

SCORE_BREAKDOWN¶

Criteria	Weight	Score	Weighted
Tests pass	25%	8/10	20
Spec traceability	20%	6/10	12
No test weakening	15%	10/10	15
Koen clean	15%	7/10	10.5
Performance budget	10%	N/A	0
Code churn	10%	1/10	1
PR size	5%	10/10	5
TOTAL			58 (adjusted — churn pattern penalty)

WHY_BLOCKED:
- Same file has been "fixed" 3 times in 14 days
- This pattern indicates the root cause is not understood
- Each fix addresses a symptom (rounding, multi-currency) but the underlying calculation model may be wrong
- The current PR is 8 lines — individually fine. But shipping another bandaid fix accumulates tech debt and risk
- If this file breaks again in production, client trust is damaged

CHURN_THRESHOLD:
- 2 changes in 14 days to the same logic file: WARNING (note in review, don't block)
- 3+ changes in 14 days to the same logic file: BLOCK for architectural review
- Exception: test files, config files, and auto-generated files are excluded from churn tracking

WHAT_DEVELOPER_SHOULD_DO:
- Step back and review the full calculation model, not just this fix
- Write a comprehensive test suite for cart total: multi-currency, rounding, discounts, tax, shipping
- Consider whether the cart total needs a dedicated, well-tested pure function with no side effects
- Submit the fix alongside the comprehensive tests in a single PR

EVALUATOR_ACTION: BLOCK. Flag CODE_CHURN. Recommend architectural review of cart total calculation before merging another patch.

EXAMPLE_4: EMERGENCY BREAK-GLASS PR — PRODUCTION FIX¶

RELEASE_READINESS_SCORE: 62/100 (BELOW THRESHOLD)¶

DECISION: PASS (BREAK-GLASS OVERRIDE)¶

PR_SUMMARY¶

Title: "[HOTFIX] Fix null pointer in user authentication"
Files changed: 1
Lines added: 3, deleted: 1
Labels: hotfix, production-incident

  const user = await db.users.findById(session.userId);
- const permissions = user.roles.flatMap(r => r.permissions);
+ const permissions = user?.roles?.flatMap(r => r.permissions) ?? [];

Incident: Users with deleted accounts can still have active sessions. When they hit any authenticated endpoint, the app crashes because user is null.

SCORE_BREAKDOWN¶

Criteria	Weight	Score	Weighted
Tests pass	25%	5/10	12.5
Spec traceability	20%	3/10	6
No test weakening	15%	10/10	15
Koen clean	15%	10/10	15
Performance budget	10%	N/A	0
Code churn	10%	8/10	8
PR size	5%	10/10	5
TOTAL			62

WHY_PASS_DESPITE_BELOW_THRESHOLD:
- BREAK-GLASS conditions are met:
1. Production incident is active (users are affected NOW)
2. Fix is minimal and surgical (3 lines, single file)
3. Fix is obviously correct (null-safe access on a nullable value)
4. Risk of NOT merging exceeds risk of merging
5. PR has hotfix label applied by authorized team member

BREAK_GLASS_RULES:
- Score is still calculated and recorded — it does not disappear
- A follow-up PR with proper tests MUST be filed within 48 hours
- The follow-up PR is tracked: if not merged within 48 hours, escalate
- Break-glass can only be invoked for production incidents, not for deadline pressure
- Break-glass PRs are limited to 20 lines changed — larger changes require normal review

WHAT_MUST_HAPPEN_NEXT:
- File follow-up ticket: "Add test for deleted-user-with-active-session path"
- The follow-up test should cover: user deleted mid-session, user suspended mid-session, user roles changed mid-session
- Consider: should session invalidation happen on user deletion? (spec question for Anna)

EVALUATOR_ACTION: PASS with BREAK_GLASS flag. Log the score. Create 48-hour follow-up tracker. Note specific tests required in follow-up.

EXAMPLE_5: RELEASE READINESS — PASS VS BLOCK COMPARISON¶

SCENARIO_A: SCORE 85 — PASS¶

PR: "Implement project dashboard with analytics"
Files: 12 changed, +480/-20

Criteria	Weight	Score	Rationale	Weighted
Tests pass	25%	9/10	All 47 tests pass, 2 flaky tests marked skip with ticket	22.5
Spec traceability	20%	8/10	11 of 12 spec items have corresponding tests, 1 deferred (tooltip wording)	16
No test weakening	15%	10/10	No deleted or weakened assertions	15
Koen clean	15%	9/10	1 warning (unused import), 0 errors	13.5
Performance budget	10%	7/10	LCP 1.8s (budget: 2.0s), CLS 0.05 (budget: 0.1)	7
Code churn	10%	8/10	New feature, no repeat changes to existing files	8
PR size	5%	6/10	480 lines is moderate, but it's a complete feature	3
TOTAL				85

VERDICT: PASS. Well-tested new feature. Minor gaps (tooltip, 2 flaky tests) are documented and tracked. Performance is within budget.

SCENARIO_B: SCORE 55 — BLOCK¶

PR: "Implement project dashboard with analytics" (same feature, different execution)
Files: 18 changed, +680/-120

Criteria	Weight	Score	Rationale	Weighted
Tests pass	25%	6/10	38 of 44 tests pass, 6 failures marked as "known issues"	15
Spec traceability	20%	4/10	Only 5 of 12 spec items have tests, error paths untested	8
No test weakening	15%	5/10	3 previously passing assertions changed to `.skip`	7.5
Koen clean	15%	6/10	4 warnings, 1 error (any type usage)	9
Performance budget	10%	3/10	LCP 2.8s (budget: 2.0s), bundle added 180KB	3
Code churn	10%	4/10	Modified 4 shared utility files with unclear scope	4
PR size	5%	4/10	680 lines with 120 deletions suggests significant refactoring mixed with feature	2
TOTAL				49 (adjusted from 48.5)

VERDICT: BLOCK. Multiple issues:
- 6 test failures dismissed as "known issues" is a red flag — tests should pass or be fixed
- Only 42% spec coverage is below minimum viable
- 3 skipped assertions is mild test weakening — those tests passed before and now don't
- Performance regression: 40% over LCP budget
- Mixed refactoring + feature in one PR makes review difficult

WHAT_DEVELOPER_SHOULD_DO:
- Split into two PRs: one for the refactoring, one for the feature
- Fix or properly address the 6 failing tests
- Restore the 3 skipped assertions or explain why the behavior changed
- Optimize LCP — 180KB bundle addition needs investigation (lazy loading? tree shaking?)
- Add tests for the 7 missing spec items

OVERRIDE_RULES¶

HARD_BLOCKS (override score to BLOCK regardless of total):
- TW-1: Any deleted assertion that was previously passing in a non-hotfix PR
- TW-2: Test file deleted without replacement
- SEC-1: Security-relevant test coverage decreased
- DATA-1: Migration contains DROP or TRUNCATE without explicit approval

HARD_PASSES (only via break-glass):
- BG-1: Active production incident + minimal fix + hotfix label
- BG-2: Maximum 20 lines changed
- BG-3: 48-hour follow-up PR requirement

SCORE_ADJUSTMENTS:
- Auto-generated files: line count excluded from PR size calculation
- Migration files: scored separately (schema correctness, reversibility)
- Test-only PRs: spec traceability weight increases to 30%, PR size weight drops to 0%
- Config-only PRs: performance budget and test pass weights swap (config rarely needs tests)