Pipeline Calibration¶

How the 10-stage pipeline is tuned, measured, and kept effective. A pipeline that catches nothing is useless. A pipeline that blocks everything is also useless. Calibration keeps each stage in the productive zone.

The Calibration Problem¶

Every quality gate faces the same tension:

Too strict: Clean code is blocked. Agents waste cycles fixing false positives. Time-to-merge increases. Throughput drops.
Too loose: Defective code passes through. Production incidents increase. Trust in the pipeline erodes. Stages become rubber stamps.

The goal is not zero false positives or zero defect escapes. The goal is an optimal ratio where the pipeline catches real problems without creating unnecessary friction.

Scoring Rubrics¶

Each stage uses a scoring rubric that defines what constitutes a pass, a warning, and a failure. Rubrics are explicit — not left to agent judgment.

Rubric structure¶

Every rubric defines:

Component	Description
Check	What is being evaluated
Pass	Specific criteria for passing
Warn	Criteria that flag but do not block
Fail	Criteria that block progression
Override	Who can override a failure (if anyone)

Example: Koen (Stage 4) deterministic quality rubric¶

Check	Pass	Warn	Fail
TypeScript strict	Zero errors	N/A	Any error
ESLint	Zero errors, zero warnings	N/A	Any error or warning
Bundle size	Under limit	Within 10% of limit	Over limit
Dependencies	No critical/high vulns	Medium vulns present	Critical or high vuln
Circular imports	None detected	N/A	Any circular import
Dead code	None detected	Unused type exports	Unused function exports

Example: Ashley (Stage 8) adversarial testing rubric¶

Check	Pass	Warn	Fail
SQL injection	All payloads rejected	N/A	Any payload executed
XSS	All payloads sanitized	Reflected but encoded	Stored or executed
Auth bypass	All attempts rejected	Session fixation possible	Any bypass achieved
IDOR	All cross-user access denied	Enumeration possible	Data accessed
Rate limiting	Enforced on all endpoints	Missing on non-critical	Missing on critical
Input validation	All fuzzing rejected cleanly	Unexpected 500 on edge	Crash or hang

Calibration Examples¶

Each stage maintains a set of calibration examples — code samples with known correct outcomes. These serve two purposes:

Training: When a new agent takes over a stage, calibration examples show exactly what pass/warn/fail looks like
Regression: If a stage's behavior changes, running calibration examples detects the drift

Calibration example format¶

id: CAL-K04-017
stage: 4
agent: koen
description: "Module with unused export that is re-exported by index"
code_sample: "examples/calibration/k04-017/"
expected_outcome: pass
rationale: >
  The export is unused in this module but re-exported by the
  barrel file. Dead code detection should follow re-export chains.
last_validated: 2026-03-15

Calibration set size per stage¶

Stage	Agent	Calibration examples
1	Anna	20 spec samples
2	Antje	35 test/spec pairs
3	Dev agents	N/A (not a gate)
4	Koen	50 code samples
5	Marije/Judith	30 integration scenarios
6	Jasper	25 reconciliation cases
7	Marco	20 conflict scenarios
8	Ashley	40 attack patterns
9	Jaap	30 SSOT cases
10	Marta/Iwona	15 merge decisions

False Positive Management¶

False positives are the silent killer of quality pipelines. When agents learn that a stage frequently blocks clean code, they lose trust in the stage. They start treating failures as "probably false positives" and look for workarounds instead of fixes.

Detection¶

False positives are detected when:

A stage blocks code that subsequently merges without changes (the block was overridden)
A stage flags an issue that no downstream stage confirms
A stage's warn/fail ratio exceeds 3:1 (more warnings than actionable findings)

Response¶

False positive rate	Action
< 5%	Healthy — no action needed
5-10%	Review flagged cases, adjust rubric thresholds
10-20%	Recalibrate stage, update calibration examples
> 20%	Stage is degraded — escalate to Joshua for audit

Tracking¶

Every false positive is logged with:

Which stage flagged it
What was flagged
Why it was a false positive
What rubric change would prevent recurrence

This log feeds into quarterly calibration reviews.

Complexity-Based Routing¶

Not every change needs all 10 stages. Routing rules determine which stages a change passes through based on its complexity.

Complexity classification¶

Signal	Weight	Rationale
Files changed	+1 per file	More files = more integration risk
Lines changed	+1 per 100 lines	Larger changes = more hiding places
Auth files touched	+5	Security-critical
DB migration included	+5	Data-critical
New dependency added	+3	Supply chain risk
Config files touched	+3	System-wide impact
API spec changed	+3	Contract change
Test files only	-5	Low risk by nature

Routing rules¶

Score	Complexity	Pipeline
0-3	Trivial	Stage 4 (Koen) → Stage 10 (Marta/Iwona)
4-8	Simple	Stages 2, 3, 4, 10
9-15	Standard	All 10 stages
16+	Critical	All 10 stages + human review

Override¶

Project leads can escalate complexity upward (force more stages) but never downward (skip stages). Only human approval can skip a stage on a specific change.

Joshua's Quarterly Audit¶

Joshua (Chief Innovation Officer) audits the pipeline every quarter. The audit reviews:

Stage effectiveness¶

For each stage:

Defect escape rate: What percentage of defects that this stage should catch actually pass through undetected?
Unique catches: How many defects does this stage catch that no other stage catches? If the answer is zero for two consecutive quarters, the stage is a candidate for removal.
Time cost: How much time does this stage add to time-to-merge?
Token cost: How many LLM tokens does this stage consume?

Pipeline-level metrics¶

Metric	Target	Red line
Defect escape to production	< 2%	> 5%
Time-to-merge (standard)	< 4 hours	> 8 hours
False positive rate (overall)	< 10%	> 20%
Pipeline completion rate	> 95%	< 90%
Stage correlation	< 0.3	> 0.7

Stage correlation¶

If two stages always agree (correlation > 0.7), one of them is redundant. The audit identifies which stage provides unique value and flags the other for potential removal or recalibration.

Pruning non-load-bearing stages¶

A stage is "non-load-bearing" when:

Its unique catch rate is < 1% for two consecutive quarters
Its false positive rate is > 15%
Removing it in a shadow run does not increase defect escapes

Non-load-bearing stages are not automatically removed. Joshua recommends removal, the team discusses, and human approval is required. The stage may be recalibrated instead of removed if the underlying failure class is still relevant.

Calibration Drift¶

Calibration drift happens when the pipeline's effectiveness changes over time without anyone noticing. Common causes:

Codebase evolution¶

The codebase changes. New frameworks, new patterns, new conventions. A rubric calibrated for Express.js does not work for Hono. A naming convention check calibrated for REST does not work for WebSocket handlers.

Prevention: Recalibrate rubrics when the tech stack changes. Run calibration examples after framework upgrades.

Agent model updates¶

LLM providers update their models. A model update can change the quality characteristics of generated code. Better models produce fewer obvious errors but potentially more subtle ones.

Prevention: Track defect escape rates per model version. When an agent's model changes, re-run calibration examples.

Survivorship bias¶

Over time, easy-to-catch defects become rare (agents learn to avoid them). The pipeline looks effective because the raw catch rate stays high, but the defects it catches are trivial. Serious defects — the ones that actually matter — escape because the pipeline was never calibrated for them.

Prevention: Joshua's quarterly audit includes adversarial injection — known-defective code is inserted into the pipeline to verify that stages still catch it.

Metrics Collection¶

Per-stage metrics (collected automatically)¶

Metric	Collection method
Defects caught	Count of fail results per stage
False positives	Count of overridden failures
Time spent	Duration from stage start to stage completion
Token cost	LLM tokens consumed (for LLM-based stages)
Defect severity	Classification of caught defects

Pipeline-level metrics (computed weekly)¶

Metric	Computation
Defect escape rate	Production incidents / total changes merged
Time-to-merge	Median time from PR open to merge
Pipeline throughput	Changes merged per day
Stage bottleneck	Stage with highest median duration
Cost per change	Total pipeline cost / changes merged

Reporting¶

Weekly metrics summary posted to the quality channel
Monthly trend report reviewed by project leads
Quarterly deep dive by Joshua (see audit section above)

Calibration Ownership¶

Responsibility	Owner
Stage 1-2 rubrics	Antje (Test Agent)
Stage 4 rubrics	Koen (Code Quality)
Stage 5 rubrics	Marije / Judith (Testing Leads)
Stage 6-7 rubrics	Jasper / Marco
Stage 8 rubrics	Ashley (Adversarial)
Stage 9 rubrics	Jaap (SSOT)
Stage 10 rubrics	Marta / Iwona (Merge Gate)
Quarterly audit	Joshua (CIO)
Complexity routing	Orchestrator (automated)
Metric collection	Pipeline infrastructure (automated)