Self-Verification¶
The Problem with Self-Assessment¶
An LLM asked to verify its own work will almost always conclude that its work is correct. This is not a bug in any specific model — it is an inherent property of autoregressive language generation. The same process that produced the original output will produce a justification for that output. The model is not "checking" its work. It is generating plausible text about its work, which is a fundamentally different operation.
Research confirms this: studies on LLM-generated code find that 29-45% contains security vulnerabilities, yet models consistently rate their own output as correct. The confidence-correctness correlation in self-assessment approaches zero for non-trivial tasks.
GE's verification philosophy: never rely on the same agent (or the same type of verification) to check work that it produced. Verification requires oracle independence — the verifier must be independent of the producer.
The Anti-LLM Quality Pipeline¶
GE's quality pipeline is specifically designed to counter the failure modes of LLM-generated code. It is called the "anti-LLM pipeline" because each stage targets a known LLM weakness.
The 10 Stages¶
| Stage | Owner | What It Catches | LLM Weakness Targeted |
|---|---|---|---|
| 1. Formal specification | Anna | Ambiguous requirements | LLMs fill ambiguity with plausible but wrong assumptions |
| 2. Spec-first test generation | Antje | Requirements misinterpretation | LLMs test what code does, not what it should do |
| 3. Implementation | Developer | N/A (production stage) | N/A |
| 4. Deterministic quality gates | Koen | Syntax errors, type violations, dead code | LLMs generate syntactically plausible but incorrect code |
| 5. Contract checks | Koen | Interface violations | LLMs change function signatures without updating callers |
| 6. Security review | Victoria/Ron | Credential exposure, injection, authz bypass | LLMs hardcode secrets, ignore auth, trust user input |
| 7. Integration testing | Marije | Cross-module failures | LLMs optimize for the unit, not the system |
| 8. Adversarial testing | Ashley | Edge cases, malformed input | LLMs assume happy-path inputs |
| 9. Reconciliation | Jasper | Drift between spec and implementation | LLMs solve a different problem than specified |
| 10. Performance review | Nessa | Resource waste, scaling issues | LLMs ignore O(n) complexity and memory allocation |
RULE: No code reaches production without passing all 10 stages. Stages cannot be skipped, reordered, or combined.
Why 10 Stages?¶
Each stage catches errors that previous stages miss. Removing any stage creates a gap. The stages are ordered by cost — cheap checks first (linting, type checking), expensive checks last (integration testing, performance review). This minimizes the cost of catching errors: most defects are caught in stages 1-5, before expensive testing begins.
Contract Checks (Mechanical Verification)¶
Contract checks are deterministic verifications that do not require LLM judgment. They are the fastest, cheapest, and most reliable form of verification.
What Contracts Check¶
| Check | What It Verifies | Tool |
|---|---|---|
| Type checking | All types are consistent | TypeScript compiler, mypy |
| Lint | Code follows style rules | ESLint, ruff |
| Import resolution | All imports resolve to real files | Compiler/bundler |
| Dead code | No exports without consumers | Custom analysis |
| Schema validation | API payloads match defined schemas | Zod, JSON Schema |
| Migration consistency | Database schema matches ORM definitions | Drizzle check |
| Config integrity | No hardcoded values that should come from config | Custom grep |
| MAXLEN presence | Every Redis XADD includes MAXLEN | Custom grep |
When Contracts Run¶
Contracts run at two points:
- Pre-completion. Before an agent declares its work done, all relevant contracts are checked automatically. If any contract fails, the agent must fix the violation before completing.
- CI gate. All contracts run in CI on every commit. A failing contract blocks the merge.
Why Contracts, Not Reviews¶
A human (or LLM) reviewer might miss a type error. A type checker will not. A reviewer might overlook a missing MAXLEN on an XADD call. A grep pattern will not. Contracts are not replacements for reviews — they are a layer that catches mechanical errors before reviews begin, freeing reviewers to focus on logic and design.
RULE: If a defect can be caught by a deterministic check, it must be caught by a deterministic check. LLM review time is expensive. Linting time is free.
Learning Protocols (Structured Self-Reflection)¶
While LLMs cannot reliably verify the correctness of their own output, they can extract patterns from their experience. GE's learning protocols capture these patterns in a structured format.
Session Learning Extraction¶
After every agent session, the PTY capture output is analyzed to extract learnings:
- What was attempted? Task description and scope.
- What succeeded? Approaches that worked, with specific details.
- What failed? Approaches that did not work, with error messages and root causes.
- What patterns emerged? Recurring themes, techniques, or pitfalls.
- What should future agents know? Actionable advice for the next agent working in this domain.
These learnings are written to the wiki brain and become available for JIT injection in future sessions.
The Struggle Detector¶
The struggle detector scores sessions across five dimensions:
| Dimension | What It Measures | High Score Indicates |
|---|---|---|
| Cost | Token consumption relative to task complexity | Inefficient prompting or excessive retries |
| Turns | Number of turns relative to task complexity | Confusion, context loss, or wrong approach |
| Failures | Count of failed attempts within the session | Misunderstanding requirements or unfamiliar domain |
| Tokens | Raw token volume (input + output) | Over-reading, over-generating, or context bloat |
| Outcome | Whether the task was completed successfully | Fundamental capability gap |
Sessions with high struggle scores are flagged for human review. The patterns in struggled sessions become high-priority learnings — they represent exactly the situations where agents need additional knowledge.
Checklists (Machine-Parseable, Organically Growing)¶
GE agents use checklists as structured verification aids. Unlike free-form self-assessment ("does this look right?"), checklists force specific checks.
Checklist Design Principles¶
1. Machine-parseable. Checklists use checkbox markdown (- [ ] / - [x]) so completion can be verified automatically.
2. Specific, not general. Bad: "- [ ] Code is correct." Good: "- [ ] All Redis XADD calls include MAXLEN."
3. Organically growing. When a new failure mode is discovered, a new checklist item is added. Checklists accrete knowledge over time. They are never "complete."
4. Role-specific. Each agent has checklists relevant to their function. A security reviewer's checklist differs from a backend developer's checklist.
Example: Backend Developer Pre-Completion Checklist¶
- [ ] All new functions have TypeScript types (no `any`)
- [ ] All database queries use parameterized inputs (no string concatenation)
- [ ] All Redis XADD calls include MAXLEN (~100 per-agent, ~1000 system)
- [ ] No hardcoded ports, URLs, or credentials
- [ ] All new exports have at least one consumer
- [ ] Error paths return structured errors, not string messages
- [ ] All async functions have error handling (try/catch or .catch())
- [ ] New API endpoints have Zod schema validation on input
- [ ] No direct file system writes in request handlers
- [ ] Config values read from config files, not constants
Checklist Evolution¶
Checklists grow through two mechanisms:
- Incident response. When a production incident is traced to a class of defect, a checklist item is added to prevent recurrence.
- Learning extraction. When multiple sessions struggle with the same issue, a checklist item is added to preempt the struggle.
RULE: Checklists are never shortened. Items can be retired (marked as automated by a contract check) but the check still runs — it just runs as code instead of as a manual verification.
Mutation Testing as LLM-Specific Quality Gate¶
Mutation testing introduces small changes (mutations) to the code and verifies that the test suite catches them. If a mutation survives (tests still pass despite the change), the test suite has a gap.
Why Mutation Testing Matters for LLM Code¶
LLMs tend to produce tests that verify the happy path but miss edge cases. A test suite with 100% line coverage but 40% mutation score provides a false sense of security. Mutation testing reveals the difference between "code was executed" (coverage) and "code behavior was verified" (mutation score).
GE's Mutation Testing Protocol¶
- Developer completes implementation and all tests pass.
- Mutation testing tool runs against the test suite.
- Surviving mutants are categorized:
- Equivalent mutants (mutation does not change behavior) — excluded.
- Trivial mutants (mutation is caught by type system) — excluded.
- Real mutants (mutation changes behavior, tests pass) — test gap identified.
- For each real mutant, Antje writes a new test that kills it.
- Mutation testing runs again until mutation score exceeds threshold.
Mutation Score Targets¶
| Code Category | Minimum Mutation Score |
|---|---|
| Security-critical (auth, crypto) | 90% |
| Business logic (billing, routing) | 80% |
| Infrastructure (deployment, config) | 70% |
| UI components | 60% |
Reconciliation (Independent Work Product Comparison)¶
Reconciliation is GE's most novel verification technique. Two agents independently produce work products from the same specification. Their outputs are compared. Discrepancies indicate either ambiguity in the specification or errors in one (or both) outputs.
How It Works¶
- Anna produces a formal specification.
- Antje derives tests from the specification (without seeing any implementation).
- A developer implements the feature (without seeing Antje's tests beyond what is needed to pass them).
- Jasper compares what the tests verify against what the implementation does.
- Any behavior that the implementation provides but the tests do not verify is suspicious.
- Any behavior that the tests expect but the implementation does not provide is a bug.
Why Reconciliation Works¶
This is oracle independence applied to the full pipeline. The test writer and the code writer both work from the same specification but bring different interpretations, different blind spots, and different assumptions. Their disagreements are the most valuable signal in the pipeline — they reveal the places where the specification was ambiguous or where an agent went wrong.
A single agent verifying its own work will not find these discrepancies. It has one interpretation, one set of blind spots, and one set of assumptions. Reconciliation provides the perspective that self-verification cannot.
The Independence Principle¶
RULE: For any verification to be meaningful, the verifier must be independent of the producer.
| Verification Type | Independence Level | Reliability |
|---|---|---|
| Same agent checks own work | None | Unreliable |
| Same model checks another instance's work | Low | Slightly better |
| Different model checks work | Medium | Moderate |
| Deterministic tool checks work | High | Reliable |
| Independent agent from spec checks work | High | Reliable |
| Human checks work | Highest | Most reliable (but expensive) |
GE's pipeline stacks these independence levels. Deterministic checks catch mechanical errors. Independent agents catch logical errors. Reconciliation catches specification ambiguity. Human review catches everything else, applied selectively at the highest-risk points.
The goal is not to eliminate all errors — that is impossible. The goal is to make errors progressively harder to survive through the pipeline, so that the errors that reach production are rare, minor, and quickly detected.