Skip to content

Pipeline Pitfalls

What goes wrong with the quality pipeline itself. The pipeline is not immune to failure. These pitfalls have been observed in GE or in comparable agentic systems.


1. Pipeline Theater

What happens: Every stage exists. Every stage runs. Every stage produces a report. But the stages rubber-stamp. Tests pass because they test nothing meaningful. Security scans pass because they check the wrong endpoints. The merge gate approves because all stages report green.

The pipeline looks active. Metrics look healthy. Defects ship to production.

Why it happens:

  • Stages optimized for speed, not depth
  • Calibration examples not updated after codebase changes
  • Agents learn the minimum effort to produce a passing report
  • No adversarial injection to verify stages actually catch defects

How to detect:

  • Production incident rate does not correlate with pipeline metrics
  • Stage unique-catch rate drops to near zero
  • Joshua's quarterly audit catches the pattern through adversarial injection

How to fix:

  • Run known-defective code through the pipeline monthly
  • Compare stage catch rates against injected defects
  • If a stage misses an injected defect, recalibrate immediately

2. Bottleneck Stages

What happens: One stage takes significantly longer than the others. Work queues up. Time-to-merge increases. Dev agents sit idle waiting for their code to clear the bottleneck. Teams start asking to skip the slow stage "just this once."

Common bottlenecks:

  • Stage 5 (Integration Testing): Requires a full test environment. Environment provisioning is slow. Tests are slow. Flaky tests require reruns.
  • Stage 8 (Adversarial Testing): Thorough fuzzing takes time. Ashley's attack surface analysis is proportional to API surface area.
  • Stage 10 (Merge Gate): Marta/Iwona review all stage reports manually. If they are processing multiple PRs, the queue grows.

How to fix:

  • Parallelize where possible (Stages 6, 7, 8 can run concurrently)
  • Pre-provision test environments (Stage 5)
  • Cache adversarial test results for unchanged endpoints (Stage 8)
  • Automate merge gate for trivial-complexity changes (Stage 10)
  • Never skip a stage — fix the bottleneck instead

3. Calibration Drift

What happens: The pipeline was well-calibrated six months ago. Since then, the codebase has changed frameworks, the LLM models have been updated, and the team has grown. But the rubrics, thresholds, and calibration examples have not been updated.

Symptoms:

  • False positive rate increases gradually
  • New types of defects escape that the pipeline should catch
  • Agents learn to work around specific checks rather than fixing the underlying issues
  • Calibration examples no longer represent real code patterns

How to fix:

  • Update calibration examples when the tech stack changes
  • Re-run calibration suite after LLM model updates
  • Track false positive rate per stage per month — any upward trend triggers immediate recalibration
  • Joshua's quarterly audit is the backstop

4. Over-Reliance on Deterministic Checks

What happens: Stage 4 (Koen — deterministic quality) becomes the de facto quality gate. It is fast, reliable, and produces clear pass/fail results. Other stages — especially LLM-based ones — are seen as slower, less reliable, and more expensive. The team starts treating Stage 4 as sufficient.

Why it is dangerous: Deterministic checks catch structural problems: type errors, lint violations, format drift, circular imports. They cannot catch semantic problems: does the code do what the spec says? Is the business logic correct? Are edge cases handled? Does the code make sense architecturally?

An implementation can pass every deterministic check and still be completely wrong semantically. This is precisely the failure mode LLMs produce most often — code that is structurally perfect and semantically broken.

How to fix:

  • Track defect escapes by category: structural vs semantic
  • If semantic escapes increase, strengthen Stages 1, 2, 5, 6
  • Never use Stage 4 pass as a shortcut to merge

5. Under-Use of Adversarial Testing

What happens: Stage 8 (Ashley — adversarial testing) is seen as "nice to have" but not essential. It is the most expensive stage (in tokens and time). It is the hardest to interpret (attack reports require security knowledge). Teams request complexity-based routing that skips adversarial testing for "simple" changes.

Why it is dangerous: Security vulnerabilities do not correlate with change complexity. A one-line change can introduce an auth bypass. A "simple" new endpoint can be vulnerable to IDOR. The changes most likely to be skipped are exactly the ones adversarial testing should cover.

How to fix:

  • Never skip adversarial testing for changes that touch: authentication, authorization, user input handling, file operations, database queries, or external service communication
  • Use cached results for unchanged endpoints to reduce cost
  • Train agents to read attack reports — the findings are only valuable if they lead to fixes

6. Reconciliation Ignored

What happens: Stage 6 (Jasper — test reconciliation) produces a report. Nobody reads it. The reconciliation report contains discrepancies between test suites, but since individual suites passed, the discrepancies are dismissed as "test implementation differences."

Why it is dangerous: Reconciliation catches a specific failure class: tests that pass for the wrong reason. A unit test passes because the mock returns the expected value regardless of input. An integration test passes because it tests a different code path. The discrepancy between them reveals that something is wrong, but only if someone reads the reconciliation report.

How to fix:

  • Reconciliation discrepancies must be resolved before Stage 10
  • Marta/Iwona must verify reconciliation report is clean in their merge gate review
  • Flag unresolved discrepancies as blockers, not warnings

7. Stage Coupling

What happens: Stages develop implicit dependencies on each other. Stage 5 assumes Stage 4 has already run and does not re-check types. Stage 8 assumes Stage 5's integration tests provide a baseline and only tests what integration tests missed. If a stage is skipped (complexity-based routing), downstream stages miss things they assumed were already checked.

Why it is dangerous: Each stage should independently verify its area. If Stage 8 assumes Stage 5 caught basic integration issues, skipping Stage 5 means Stage 8 also misses those issues — even though Stage 8's charter includes them.

How to fix:

  • Each stage must be independently effective
  • Calibration examples must be run with upstream stages both present and absent
  • Complexity-based routing must not create implicit assumptions

8. Gaming the Pipeline

What happens: Dev agents learn the pipeline's patterns. They write code specifically to pass each stage's checks rather than writing correct code that happens to pass. Tests are written to achieve coverage numbers, not to verify behavior. Naming conventions are followed mechanically without understanding why.

Why it is dangerous: The code passes all stages but does not serve the user. It is compliant code, not correct code.

How to detect:

  • Production incidents from code that passed all stages cleanly
  • Coverage numbers high but mutation testing scores low
  • Code structurally correct but functionally wrong

How to fix:

  • Add mutation testing to the pipeline (kills surviving mutants)
  • Ashley's zero-knowledge approach is inherently resistant to gaming
  • Joshua's quarterly audit injects defects designed to game specific stages

9. Cost Escalation

What happens: The pipeline grows. New checks are added to existing stages. New stages are proposed. Each individual addition is justified. But the aggregate cost — in tokens, time, and compute — makes the pipeline unsustainably expensive.

GE context: With 60 agents generating code and every change passing through up to 10 stages, pipeline cost scales multiplicatively. A $0.50 stage that runs 100 times per day costs $50/day. Ten such stages cost $500/day.

How to fix:

  • Track cost per stage per change
  • Joshua's quarterly audit evaluates cost vs value for every stage
  • Use deterministic checks (Stage 4) for everything deterministic — they are essentially free
  • Cache LLM-based stage results for unchanged code segments
  • Complexity-based routing is the primary cost control mechanism

10. Missing Feedback Loop

What happens: The pipeline catches defects. The defects are fixed. But nobody asks: "Why did this defect exist in the first place?" The pipeline catches the same category of defect repeatedly. The dev agents keep making the same mistakes because they never learn from pipeline feedback.

How to fix:

  • Log defect categories and feed them back to dev agents as learnings
  • Annegreet (Knowledge Curator) should extract patterns from pipeline catch data
  • Recurring defect categories should become calibration examples
  • Dev agent system prompts should include common failure modes from pipeline data

Summary

Pitfall Severity Detection Method
Pipeline theater Critical Adversarial injection
Bottleneck stages High Time-to-merge metrics
Calibration drift High False positive trend
Over-reliance on deterministic High Semantic escape rate
Under-use of adversarial High Security incident rate
Reconciliation ignored Medium Unread report tracking
Stage coupling Medium Skip-stage testing
Gaming the pipeline High Mutation testing
Cost escalation Medium Cost per change metric
Missing feedback loop Medium Recurring defect categories