Skip to content

Pipeline Calibration

How the 10-stage pipeline is tuned, measured, and kept effective. A pipeline that catches nothing is useless. A pipeline that blocks everything is also useless. Calibration keeps each stage in the productive zone.


The Calibration Problem

Every quality gate faces the same tension:

  • Too strict: Clean code is blocked. Agents waste cycles fixing false positives. Time-to-merge increases. Throughput drops.
  • Too loose: Defective code passes through. Production incidents increase. Trust in the pipeline erodes. Stages become rubber stamps.

The goal is not zero false positives or zero defect escapes. The goal is an optimal ratio where the pipeline catches real problems without creating unnecessary friction.


Scoring Rubrics

Each stage uses a scoring rubric that defines what constitutes a pass, a warning, and a failure. Rubrics are explicit — not left to agent judgment.

Rubric structure

Every rubric defines:

Component Description
Check What is being evaluated
Pass Specific criteria for passing
Warn Criteria that flag but do not block
Fail Criteria that block progression
Override Who can override a failure (if anyone)

Example: Koen (Stage 4) deterministic quality rubric

Check Pass Warn Fail
TypeScript strict Zero errors N/A Any error
ESLint Zero errors, zero warnings N/A Any error or warning
Bundle size Under limit Within 10% of limit Over limit
Dependencies No critical/high vulns Medium vulns present Critical or high vuln
Circular imports None detected N/A Any circular import
Dead code None detected Unused type exports Unused function exports

Example: Ashley (Stage 8) adversarial testing rubric

Check Pass Warn Fail
SQL injection All payloads rejected N/A Any payload executed
XSS All payloads sanitized Reflected but encoded Stored or executed
Auth bypass All attempts rejected Session fixation possible Any bypass achieved
IDOR All cross-user access denied Enumeration possible Data accessed
Rate limiting Enforced on all endpoints Missing on non-critical Missing on critical
Input validation All fuzzing rejected cleanly Unexpected 500 on edge Crash or hang

Calibration Examples

Each stage maintains a set of calibration examples — code samples with known correct outcomes. These serve two purposes:

  1. Training: When a new agent takes over a stage, calibration examples show exactly what pass/warn/fail looks like
  2. Regression: If a stage's behavior changes, running calibration examples detects the drift

Calibration example format

id: CAL-K04-017
stage: 4
agent: koen
description: "Module with unused export that is re-exported by index"
code_sample: "examples/calibration/k04-017/"
expected_outcome: pass
rationale: >
  The export is unused in this module but re-exported by the
  barrel file. Dead code detection should follow re-export chains.
last_validated: 2026-03-15

Calibration set size per stage

Stage Agent Calibration examples
1 Anna 20 spec samples
2 Antje 35 test/spec pairs
3 Dev agents N/A (not a gate)
4 Koen 50 code samples
5 Marije/Judith 30 integration scenarios
6 Jasper 25 reconciliation cases
7 Marco 20 conflict scenarios
8 Ashley 40 attack patterns
9 Jaap 30 SSOT cases
10 Marta/Iwona 15 merge decisions

False Positive Management

False positives are the silent killer of quality pipelines. When agents learn that a stage frequently blocks clean code, they lose trust in the stage. They start treating failures as "probably false positives" and look for workarounds instead of fixes.

Detection

False positives are detected when:

  • A stage blocks code that subsequently merges without changes (the block was overridden)
  • A stage flags an issue that no downstream stage confirms
  • A stage's warn/fail ratio exceeds 3:1 (more warnings than actionable findings)

Response

False positive rate Action
< 5% Healthy — no action needed
5-10% Review flagged cases, adjust rubric thresholds
10-20% Recalibrate stage, update calibration examples
> 20% Stage is degraded — escalate to Joshua for audit

Tracking

Every false positive is logged with:

  • Which stage flagged it
  • What was flagged
  • Why it was a false positive
  • What rubric change would prevent recurrence

This log feeds into quarterly calibration reviews.


Complexity-Based Routing

Not every change needs all 10 stages. Routing rules determine which stages a change passes through based on its complexity.

Complexity classification

Signal Weight Rationale
Files changed +1 per file More files = more integration risk
Lines changed +1 per 100 lines Larger changes = more hiding places
Auth files touched +5 Security-critical
DB migration included +5 Data-critical
New dependency added +3 Supply chain risk
Config files touched +3 System-wide impact
API spec changed +3 Contract change
Test files only -5 Low risk by nature

Routing rules

Score Complexity Pipeline
0-3 Trivial Stage 4 (Koen) → Stage 10 (Marta/Iwona)
4-8 Simple Stages 2, 3, 4, 10
9-15 Standard All 10 stages
16+ Critical All 10 stages + human review

Override

Project leads can escalate complexity upward (force more stages) but never downward (skip stages). Only human approval can skip a stage on a specific change.


Joshua's Quarterly Audit

Joshua (Chief Innovation Officer) audits the pipeline every quarter. The audit reviews:

Stage effectiveness

For each stage:

  • Defect escape rate: What percentage of defects that this stage should catch actually pass through undetected?
  • Unique catches: How many defects does this stage catch that no other stage catches? If the answer is zero for two consecutive quarters, the stage is a candidate for removal.
  • Time cost: How much time does this stage add to time-to-merge?
  • Token cost: How many LLM tokens does this stage consume?

Pipeline-level metrics

Metric Target Red line
Defect escape to production < 2% > 5%
Time-to-merge (standard) < 4 hours > 8 hours
False positive rate (overall) < 10% > 20%
Pipeline completion rate > 95% < 90%
Stage correlation < 0.3 > 0.7

Stage correlation

If two stages always agree (correlation > 0.7), one of them is redundant. The audit identifies which stage provides unique value and flags the other for potential removal or recalibration.

Pruning non-load-bearing stages

A stage is "non-load-bearing" when:

  1. Its unique catch rate is < 1% for two consecutive quarters
  2. Its false positive rate is > 15%
  3. Removing it in a shadow run does not increase defect escapes

Non-load-bearing stages are not automatically removed. Joshua recommends removal, the team discusses, and human approval is required. The stage may be recalibrated instead of removed if the underlying failure class is still relevant.


Calibration Drift

Calibration drift happens when the pipeline's effectiveness changes over time without anyone noticing. Common causes:

Codebase evolution

The codebase changes. New frameworks, new patterns, new conventions. A rubric calibrated for Express.js does not work for Hono. A naming convention check calibrated for REST does not work for WebSocket handlers.

Prevention: Recalibrate rubrics when the tech stack changes. Run calibration examples after framework upgrades.

Agent model updates

LLM providers update their models. A model update can change the quality characteristics of generated code. Better models produce fewer obvious errors but potentially more subtle ones.

Prevention: Track defect escape rates per model version. When an agent's model changes, re-run calibration examples.

Survivorship bias

Over time, easy-to-catch defects become rare (agents learn to avoid them). The pipeline looks effective because the raw catch rate stays high, but the defects it catches are trivial. Serious defects — the ones that actually matter — escape because the pipeline was never calibrated for them.

Prevention: Joshua's quarterly audit includes adversarial injection — known-defective code is inserted into the pipeline to verify that stages still catch it.


Metrics Collection

Per-stage metrics (collected automatically)

Metric Collection method
Defects caught Count of fail results per stage
False positives Count of overridden failures
Time spent Duration from stage start to stage completion
Token cost LLM tokens consumed (for LLM-based stages)
Defect severity Classification of caught defects

Pipeline-level metrics (computed weekly)

Metric Computation
Defect escape rate Production incidents / total changes merged
Time-to-merge Median time from PR open to merge
Pipeline throughput Changes merged per day
Stage bottleneck Stage with highest median duration
Cost per change Total pipeline cost / changes merged

Reporting

  • Weekly metrics summary posted to the quality channel
  • Monthly trend report reviewed by project leads
  • Quarterly deep dive by Joshua (see audit section above)

Calibration Ownership

Responsibility Owner
Stage 1-2 rubrics Antje (Test Agent)
Stage 4 rubrics Koen (Code Quality)
Stage 5 rubrics Marije / Judith (Testing Leads)
Stage 6-7 rubrics Jasper / Marco
Stage 8 rubrics Ashley (Adversarial)
Stage 9 rubrics Jaap (SSOT)
Stage 10 rubrics Marta / Iwona (Merge Gate)
Quarterly audit Joshua (CIO)
Complexity routing Orchestrator (automated)
Metric collection Pipeline infrastructure (automated)