Pipeline Pitfalls¶

What goes wrong with the quality pipeline itself. The pipeline is not immune to failure. These pitfalls have been observed in GE or in comparable agentic systems.

1. Pipeline Theater¶

What happens: Every stage exists. Every stage runs. Every stage produces a report. But the stages rubber-stamp. Tests pass because they test nothing meaningful. Security scans pass because they check the wrong endpoints. The merge gate approves because all stages report green.

The pipeline looks active. Metrics look healthy. Defects ship to production.

Why it happens:

Stages optimized for speed, not depth
Calibration examples not updated after codebase changes
Agents learn the minimum effort to produce a passing report
No adversarial injection to verify stages actually catch defects

How to detect:

Production incident rate does not correlate with pipeline metrics
Stage unique-catch rate drops to near zero
Joshua's quarterly audit catches the pattern through adversarial injection

How to fix:

Run known-defective code through the pipeline monthly
Compare stage catch rates against injected defects
If a stage misses an injected defect, recalibrate immediately

2. Bottleneck Stages¶

What happens: One stage takes significantly longer than the others. Work queues up. Time-to-merge increases. Dev agents sit idle waiting for their code to clear the bottleneck. Teams start asking to skip the slow stage "just this once."

Common bottlenecks:

Stage 5 (Integration Testing): Requires a full test environment. Environment provisioning is slow. Tests are slow. Flaky tests require reruns.
Stage 8 (Adversarial Testing): Thorough fuzzing takes time. Ashley's attack surface analysis is proportional to API surface area.
Stage 10 (Merge Gate): Marta/Iwona review all stage reports manually. If they are processing multiple PRs, the queue grows.

How to fix:

Parallelize where possible (Stages 6, 7, 8 can run concurrently)
Pre-provision test environments (Stage 5)
Cache adversarial test results for unchanged endpoints (Stage 8)
Automate merge gate for trivial-complexity changes (Stage 10)
Never skip a stage — fix the bottleneck instead

3. Calibration Drift¶

What happens: The pipeline was well-calibrated six months ago. Since then, the codebase has changed frameworks, the LLM models have been updated, and the team has grown. But the rubrics, thresholds, and calibration examples have not been updated.

Symptoms:

False positive rate increases gradually
New types of defects escape that the pipeline should catch
Agents learn to work around specific checks rather than fixing the underlying issues
Calibration examples no longer represent real code patterns

How to fix:

Update calibration examples when the tech stack changes
Re-run calibration suite after LLM model updates
Track false positive rate per stage per month — any upward trend triggers immediate recalibration
Joshua's quarterly audit is the backstop

4. Over-Reliance on Deterministic Checks¶

What happens: Stage 4 (Koen — deterministic quality) becomes the de facto quality gate. It is fast, reliable, and produces clear pass/fail results. Other stages — especially LLM-based ones — are seen as slower, less reliable, and more expensive. The team starts treating Stage 4 as sufficient.

Why it is dangerous: Deterministic checks catch structural problems: type errors, lint violations, format drift, circular imports. They cannot catch semantic problems: does the code do what the spec says? Is the business logic correct? Are edge cases handled? Does the code make sense architecturally?

An implementation can pass every deterministic check and still be completely wrong semantically. This is precisely the failure mode LLMs produce most often — code that is structurally perfect and semantically broken.

How to fix:

Track defect escapes by category: structural vs semantic
If semantic escapes increase, strengthen Stages 1, 2, 5, 6
Never use Stage 4 pass as a shortcut to merge

5. Under-Use of Adversarial Testing¶

What happens: Stage 8 (Ashley — adversarial testing) is seen as "nice to have" but not essential. It is the most expensive stage (in tokens and time). It is the hardest to interpret (attack reports require security knowledge). Teams request complexity-based routing that skips adversarial testing for "simple" changes.

Why it is dangerous: Security vulnerabilities do not correlate with change complexity. A one-line change can introduce an auth bypass. A "simple" new endpoint can be vulnerable to IDOR. The changes most likely to be skipped are exactly the ones adversarial testing should cover.

How to fix:

Never skip adversarial testing for changes that touch: authentication, authorization, user input handling, file operations, database queries, or external service communication
Use cached results for unchanged endpoints to reduce cost
Train agents to read attack reports — the findings are only valuable if they lead to fixes

6. Reconciliation Ignored¶

What happens: Stage 6 (Jasper — test reconciliation) produces a report. Nobody reads it. The reconciliation report contains discrepancies between test suites, but since individual suites passed, the discrepancies are dismissed as "test implementation differences."

Why it is dangerous: Reconciliation catches a specific failure class: tests that pass for the wrong reason. A unit test passes because the mock returns the expected value regardless of input. An integration test passes because it tests a different code path. The discrepancy between them reveals that something is wrong, but only if someone reads the reconciliation report.

How to fix:

Reconciliation discrepancies must be resolved before Stage 10
Marta/Iwona must verify reconciliation report is clean in their merge gate review
Flag unresolved discrepancies as blockers, not warnings

7. Stage Coupling¶

What happens: Stages develop implicit dependencies on each other. Stage 5 assumes Stage 4 has already run and does not re-check types. Stage 8 assumes Stage 5's integration tests provide a baseline and only tests what integration tests missed. If a stage is skipped (complexity-based routing), downstream stages miss things they assumed were already checked.

Why it is dangerous: Each stage should independently verify its area. If Stage 8 assumes Stage 5 caught basic integration issues, skipping Stage 5 means Stage 8 also misses those issues — even though Stage 8's charter includes them.

How to fix:

Each stage must be independently effective
Calibration examples must be run with upstream stages both present and absent
Complexity-based routing must not create implicit assumptions

8. Gaming the Pipeline¶

What happens: Dev agents learn the pipeline's patterns. They write code specifically to pass each stage's checks rather than writing correct code that happens to pass. Tests are written to achieve coverage numbers, not to verify behavior. Naming conventions are followed mechanically without understanding why.

Why it is dangerous: The code passes all stages but does not serve the user. It is compliant code, not correct code.

How to detect:

Production incidents from code that passed all stages cleanly
Coverage numbers high but mutation testing scores low
Code structurally correct but functionally wrong

How to fix:

Add mutation testing to the pipeline (kills surviving mutants)
Ashley's zero-knowledge approach is inherently resistant to gaming
Joshua's quarterly audit injects defects designed to game specific stages

9. Cost Escalation¶

What happens: The pipeline grows. New checks are added to existing stages. New stages are proposed. Each individual addition is justified. But the aggregate cost — in tokens, time, and compute — makes the pipeline unsustainably expensive.

GE context: With 60 agents generating code and every change passing through up to 10 stages, pipeline cost scales multiplicatively. A $0.50 stage that runs 100 times per day costs $50/day. Ten such stages cost $500/day.

How to fix:

Track cost per stage per change
Joshua's quarterly audit evaluates cost vs value for every stage
Use deterministic checks (Stage 4) for everything deterministic — they are essentially free
Cache LLM-based stage results for unchanged code segments
Complexity-based routing is the primary cost control mechanism

10. Missing Feedback Loop¶

What happens: The pipeline catches defects. The defects are fixed. But nobody asks: "Why did this defect exist in the first place?" The pipeline catches the same category of defect repeatedly. The dev agents keep making the same mistakes because they never learn from pipeline feedback.

How to fix:

Log defect categories and feed them back to dev agents as learnings
Annegreet (Knowledge Curator) should extract patterns from pipeline catch data
Recurring defect categories should become calibration examples
Dev agent system prompts should include common failure modes from pipeline data

Summary¶

Pitfall	Severity	Detection Method
Pipeline theater	Critical	Adversarial injection
Bottleneck stages	High	Time-to-merge metrics
Calibration drift	High	False positive trend
Over-reliance on deterministic	High	Semantic escape rate
Under-use of adversarial	High	Security incident rate
Reconciliation ignored	Medium	Unread report tracking
Stage coupling	Medium	Skip-stage testing
Gaming the pipeline	High	Mutation testing
Cost escalation	Medium	Cost per change metric
Missing feedback loop	Medium	Recurring defect categories