Pipeline Calibration¶
How the 10-stage pipeline is tuned, measured, and kept effective. A pipeline that catches nothing is useless. A pipeline that blocks everything is also useless. Calibration keeps each stage in the productive zone.
The Calibration Problem¶
Every quality gate faces the same tension:
- Too strict: Clean code is blocked. Agents waste cycles fixing false positives. Time-to-merge increases. Throughput drops.
- Too loose: Defective code passes through. Production incidents increase. Trust in the pipeline erodes. Stages become rubber stamps.
The goal is not zero false positives or zero defect escapes. The goal is an optimal ratio where the pipeline catches real problems without creating unnecessary friction.
Scoring Rubrics¶
Each stage uses a scoring rubric that defines what constitutes a pass, a warning, and a failure. Rubrics are explicit — not left to agent judgment.
Rubric structure¶
Every rubric defines:
| Component | Description |
|---|---|
| Check | What is being evaluated |
| Pass | Specific criteria for passing |
| Warn | Criteria that flag but do not block |
| Fail | Criteria that block progression |
| Override | Who can override a failure (if anyone) |
Example: Koen (Stage 4) deterministic quality rubric¶
| Check | Pass | Warn | Fail |
|---|---|---|---|
| TypeScript strict | Zero errors | N/A | Any error |
| ESLint | Zero errors, zero warnings | N/A | Any error or warning |
| Bundle size | Under limit | Within 10% of limit | Over limit |
| Dependencies | No critical/high vulns | Medium vulns present | Critical or high vuln |
| Circular imports | None detected | N/A | Any circular import |
| Dead code | None detected | Unused type exports | Unused function exports |
Example: Ashley (Stage 8) adversarial testing rubric¶
| Check | Pass | Warn | Fail |
|---|---|---|---|
| SQL injection | All payloads rejected | N/A | Any payload executed |
| XSS | All payloads sanitized | Reflected but encoded | Stored or executed |
| Auth bypass | All attempts rejected | Session fixation possible | Any bypass achieved |
| IDOR | All cross-user access denied | Enumeration possible | Data accessed |
| Rate limiting | Enforced on all endpoints | Missing on non-critical | Missing on critical |
| Input validation | All fuzzing rejected cleanly | Unexpected 500 on edge | Crash or hang |
Calibration Examples¶
Each stage maintains a set of calibration examples — code samples with known correct outcomes. These serve two purposes:
- Training: When a new agent takes over a stage, calibration examples show exactly what pass/warn/fail looks like
- Regression: If a stage's behavior changes, running calibration examples detects the drift
Calibration example format¶
id: CAL-K04-017
stage: 4
agent: koen
description: "Module with unused export that is re-exported by index"
code_sample: "examples/calibration/k04-017/"
expected_outcome: pass
rationale: >
The export is unused in this module but re-exported by the
barrel file. Dead code detection should follow re-export chains.
last_validated: 2026-03-15
Calibration set size per stage¶
| Stage | Agent | Calibration examples |
|---|---|---|
| 1 | Anna | 20 spec samples |
| 2 | Antje | 35 test/spec pairs |
| 3 | Dev agents | N/A (not a gate) |
| 4 | Koen | 50 code samples |
| 5 | Marije/Judith | 30 integration scenarios |
| 6 | Jasper | 25 reconciliation cases |
| 7 | Marco | 20 conflict scenarios |
| 8 | Ashley | 40 attack patterns |
| 9 | Jaap | 30 SSOT cases |
| 10 | Marta/Iwona | 15 merge decisions |
False Positive Management¶
False positives are the silent killer of quality pipelines. When agents learn that a stage frequently blocks clean code, they lose trust in the stage. They start treating failures as "probably false positives" and look for workarounds instead of fixes.
Detection¶
False positives are detected when:
- A stage blocks code that subsequently merges without changes (the block was overridden)
- A stage flags an issue that no downstream stage confirms
- A stage's warn/fail ratio exceeds 3:1 (more warnings than actionable findings)
Response¶
| False positive rate | Action |
|---|---|
| < 5% | Healthy — no action needed |
| 5-10% | Review flagged cases, adjust rubric thresholds |
| 10-20% | Recalibrate stage, update calibration examples |
| > 20% | Stage is degraded — escalate to Joshua for audit |
Tracking¶
Every false positive is logged with:
- Which stage flagged it
- What was flagged
- Why it was a false positive
- What rubric change would prevent recurrence
This log feeds into quarterly calibration reviews.
Complexity-Based Routing¶
Not every change needs all 10 stages. Routing rules determine which stages a change passes through based on its complexity.
Complexity classification¶
| Signal | Weight | Rationale |
|---|---|---|
| Files changed | +1 per file | More files = more integration risk |
| Lines changed | +1 per 100 lines | Larger changes = more hiding places |
| Auth files touched | +5 | Security-critical |
| DB migration included | +5 | Data-critical |
| New dependency added | +3 | Supply chain risk |
| Config files touched | +3 | System-wide impact |
| API spec changed | +3 | Contract change |
| Test files only | -5 | Low risk by nature |
Routing rules¶
| Score | Complexity | Pipeline |
|---|---|---|
| 0-3 | Trivial | Stage 4 (Koen) → Stage 10 (Marta/Iwona) |
| 4-8 | Simple | Stages 2, 3, 4, 10 |
| 9-15 | Standard | All 10 stages |
| 16+ | Critical | All 10 stages + human review |
Override¶
Project leads can escalate complexity upward (force more stages) but never downward (skip stages). Only human approval can skip a stage on a specific change.
Joshua's Quarterly Audit¶
Joshua (Chief Innovation Officer) audits the pipeline every quarter. The audit reviews:
Stage effectiveness¶
For each stage:
- Defect escape rate: What percentage of defects that this stage should catch actually pass through undetected?
- Unique catches: How many defects does this stage catch that no other stage catches? If the answer is zero for two consecutive quarters, the stage is a candidate for removal.
- Time cost: How much time does this stage add to time-to-merge?
- Token cost: How many LLM tokens does this stage consume?
Pipeline-level metrics¶
| Metric | Target | Red line |
|---|---|---|
| Defect escape to production | < 2% | > 5% |
| Time-to-merge (standard) | < 4 hours | > 8 hours |
| False positive rate (overall) | < 10% | > 20% |
| Pipeline completion rate | > 95% | < 90% |
| Stage correlation | < 0.3 | > 0.7 |
Stage correlation¶
If two stages always agree (correlation > 0.7), one of them is redundant. The audit identifies which stage provides unique value and flags the other for potential removal or recalibration.
Pruning non-load-bearing stages¶
A stage is "non-load-bearing" when:
- Its unique catch rate is < 1% for two consecutive quarters
- Its false positive rate is > 15%
- Removing it in a shadow run does not increase defect escapes
Non-load-bearing stages are not automatically removed. Joshua recommends removal, the team discusses, and human approval is required. The stage may be recalibrated instead of removed if the underlying failure class is still relevant.
Calibration Drift¶
Calibration drift happens when the pipeline's effectiveness changes over time without anyone noticing. Common causes:
Codebase evolution¶
The codebase changes. New frameworks, new patterns, new conventions. A rubric calibrated for Express.js does not work for Hono. A naming convention check calibrated for REST does not work for WebSocket handlers.
Prevention: Recalibrate rubrics when the tech stack changes. Run calibration examples after framework upgrades.
Agent model updates¶
LLM providers update their models. A model update can change the quality characteristics of generated code. Better models produce fewer obvious errors but potentially more subtle ones.
Prevention: Track defect escape rates per model version. When an agent's model changes, re-run calibration examples.
Survivorship bias¶
Over time, easy-to-catch defects become rare (agents learn to avoid them). The pipeline looks effective because the raw catch rate stays high, but the defects it catches are trivial. Serious defects — the ones that actually matter — escape because the pipeline was never calibrated for them.
Prevention: Joshua's quarterly audit includes adversarial injection — known-defective code is inserted into the pipeline to verify that stages still catch it.
Metrics Collection¶
Per-stage metrics (collected automatically)¶
| Metric | Collection method |
|---|---|
| Defects caught | Count of fail results per stage |
| False positives | Count of overridden failures |
| Time spent | Duration from stage start to stage completion |
| Token cost | LLM tokens consumed (for LLM-based stages) |
| Defect severity | Classification of caught defects |
Pipeline-level metrics (computed weekly)¶
| Metric | Computation |
|---|---|
| Defect escape rate | Production incidents / total changes merged |
| Time-to-merge | Median time from PR open to merge |
| Pipeline throughput | Changes merged per day |
| Stage bottleneck | Stage with highest median duration |
| Cost per change | Total pipeline cost / changes merged |
Reporting¶
- Weekly metrics summary posted to the quality channel
- Monthly trend report reviewed by project leads
- Quarterly deep dive by Joshua (see audit section above)
Calibration Ownership¶
| Responsibility | Owner |
|---|---|
| Stage 1-2 rubrics | Antje (Test Agent) |
| Stage 4 rubrics | Koen (Code Quality) |
| Stage 5 rubrics | Marije / Judith (Testing Leads) |
| Stage 6-7 rubrics | Jasper / Marco |
| Stage 8 rubrics | Ashley (Adversarial) |
| Stage 9 rubrics | Jaap (SSOT) |
| Stage 10 rubrics | Marta / Iwona (Merge Gate) |
| Quarterly audit | Joshua (CIO) |
| Complexity routing | Orchestrator (automated) |
| Metric collection | Pipeline infrastructure (automated) |