Skip to content

Agentic Software Engineering

The State of the Field (2024-2026)

Agentic software engineering is the practice of building production software using autonomous LLM-powered agents that operate within structured pipelines. It is not prompt engineering. It is not chatbot development. It is the systematic application of large language models as specialized workers in a disciplined software delivery process.

The field emerged in 2024 when multi-agent frameworks like MetaGPT, ChatDev, and AutoGen demonstrated that LLMs could be organized into teams that collaborate on software tasks. By early 2025, a systematic review by He, Treude, and Lo identified 41 primary studies on LLM-based multi-agent systems for software engineering, establishing the field as a distinct research area. By mid-2025, more than half of surveyed organizations reported running agents in production environments (LangChain State of Agent Engineering).

Yet the field remains immature. Analysis of 1,642 multi-agent execution traces across seven state-of-the-art systems reveals failure rates between 41% and 86.7%. Most failures arise not from individual agent limitations but from the challenges of inter-agent interaction — communication breakdowns, role confusion, cascading errors, and runaway costs. The gap between demo and production is enormous.

GE operates in this gap. We are not building toy systems or research prototypes. We are running a production software agency with 60 specialized agents delivering enterprise-grade SaaS to paying clients.


What Works

Specialization Over Generality

A general-purpose agent that "does everything" performs worse than a specialist that does one thing well. This is not intuitive — LLMs are generalist models — but it is empirically true in production settings.

When an agent has a narrow role (test generation, code review, infrastructure deployment), its identity prompt can be precise, its context window can be focused, and its output can be mechanically verified. A generalist agent's output is harder to constrain, harder to verify, and harder to improve.

GE implements this through 60 agents organized by function: client relations, development (two parallel teams), quality assurance, infrastructure, security, knowledge management, and orchestration. Each agent has a defined role, defined boundaries, and defined handoff points.

Pipelines Over Autonomy

The dominant failure mode in agentic systems is over-autonomy — agents making decisions they lack the context or authority to make. The solution is pipeline architecture: work flows through a defined sequence of stages, each with clear inputs, outputs, and verification gates.

GE's pipeline mirrors a traditional software agency:

Client Contact -> Intake -> Scoping -> Specification -> Test Generation
  -> Implementation -> Quality Gates -> Integration Testing -> Deployment

Each stage is owned by specific agents. Handoffs are explicit. No agent can skip a stage. This is less exciting than "autonomous AI developer" but it actually works.

Mechanical Verification at Every Stage

LLM output cannot be trusted on face value. Research shows 29-45% of AI-generated code contains security vulnerabilities, and nearly 20% of package recommendations reference libraries that do not exist. Self-confidence in LLM output is not correlated with correctness.

The solution is mechanical verification: automated checks that do not rely on LLM judgment. Linting, type checking, test execution, static analysis, and deterministic contract validation. These are cheap, fast, and reliable. They catch the most common classes of LLM error before any human or agent reviewer needs to look at the code.

Persistent Identity

Agents with detailed, consistent identities produce better work than agents with generic prompts. This is the "persona effect" — a well-characterized agent behaves more predictably, maintains role boundaries, and produces output that is stylistically and functionally consistent across tasks.

GE implements this through tiered identity files: a compact CORE identity (~1,200 tokens) that fits in every context window, a ROLE identity (~2,500 tokens) for task-relevant detail, and a REFERENCE identity (~3,500 tokens) for commissioning and alignment.


What Does Not Work

Full Autonomy

No current LLM is reliable enough to make architectural, security, or business decisions without human oversight. Agents that operate autonomously in these domains will eventually make a catastrophic error. The probability is not if, but when.

GE's principle: agents execute policy, humans decide policy.

Single-Agent Architecture

A single LLM instance handling the entire software lifecycle has no checks, no balances, and no error correction. When it hallucinates, nothing catches it. When it forgets context, nobody notices. When it makes a bad architectural decision, it compounds the error through every subsequent step.

Multi-agent architectures are harder to build and more expensive to run, but they provide the redundancy and cross-checking that production software demands.

Unstructured Communication

Agents communicating through free-form natural language lose information, misinterpret instructions, and introduce ambiguity at every handoff. Structured communication — typed messages, defined fields, machine-parseable formats — eliminates an entire class of failure.

GE uses Redis Streams with defined message schemas. Every message has a type, a source, a target, and a structured payload. This is auditable, replayable, and unambiguous.

Scaling Before Pipeline Maturity

Adding more agents to a broken pipeline does not fix it. It multiplies the failure rate. GE learned this the hard way: premature scaling of hook chains and monitoring agents caused a $100/hour token burn that required emergency intervention.

The rule: prove the pipeline works with a small team before expanding. Every new agent must demonstrate measurable value before becoming permanent.


GE's Position in the Landscape

Most multi-agent frameworks focus on the mechanics of agent communication (who talks to whom, in what order). GE focuses on something different: institutional knowledge.

The GE Brain — a MkDocs wiki fed by PTY capture, discussion outcomes, and structured learning extraction — gives agents access to the accumulated experience of every previous session. When an agent starts a task, it does not begin from zero. It begins with relevant learnings from every agent that has worked in the same domain before.

This is the difference between a team that learns and a team that repeats mistakes. Most multi-agent systems are stateless: they solve each problem from scratch. GE is stateful: it gets better over time.

Key Differentiators

Dimension Typical Multi-Agent System GE
Knowledge Stateless per session Persistent wiki brain with JIT injection
Quality Self-assessed or single-reviewer 10-stage anti-LLM pipeline
Communication Direct agent-to-agent Async via Redis Streams (auditable)
Governance Implicit or absent Constitutional principles (10 rules)
Human role Prompt author CEO/Product Owner with decision authority
Scaling Add agents freely Prove pipeline, then scale

The Maturity Model

Agentic engineering maturity can be assessed on a five-level scale. Most organizations are at Level 1 or 2. GE operates at Level 4 and is working toward Level 5.

Level 1: Assisted

LLMs are used as code completion tools. A human developer writes most code and uses AI for autocompletion, boilerplate generation, or answering questions. No autonomous execution. No pipeline.

Characteristic: Human does the work, AI speeds it up.

Level 2: Delegated

Individual tasks are delegated to LLM agents. A human defines the task, the agent executes it, and the human reviews the output. No inter-agent communication. No persistent knowledge.

Characteristic: AI does some work, human reviews all of it.

Level 3: Orchestrated

Multiple agents work together in a defined pipeline. An orchestrator routes work between agents. Some verification is automated. Human intervention is required for key decisions but not for every step.

Characteristic: AI team does the work, human manages the team.

Level 4: Self-Learning

The agent system captures learnings from every session and injects them into future sessions. Agents improve over time. The system gets faster, cheaper, and more reliable with use. Human intervention is limited to policy decisions and exception handling.

Characteristic: AI team does the work and learns from it. Human sets direction.

Level 5: Self-Healing

The agent system detects failures, diagnoses root causes, and implements fixes without human intervention (within defined safety boundaries). Monitoring agents oversee production agents. The human defines the boundaries of self-healing authority.

Characteristic: AI team does the work, learns from it, and fixes its own problems. Human defines the rules.


Open Research Questions

Agentic software engineering is young enough that many fundamental questions remain unanswered. GE's operational experience contributes evidence toward answers, but these are not settled:

  1. Optimal agent count. Is there a point of diminishing returns when adding agents? GE operates with 60, but would 30 produce the same output with less coordination overhead? Would 120 produce more?

  2. Cross-model consensus. When agents on different model providers (Claude, OpenAI, Gemini) disagree, is the disagreement signal more valuable than same-provider disagreement? Early GE evidence suggests yes, but the sample size is small.

  3. Learning saturation. Does the wiki brain's value plateau as it accumulates more knowledge? Is there a point where older learnings become noise rather than signal?

  4. Trust calibration. How should trust in agents change over time? An agent that has completed 500 tasks successfully has a track record. Should it earn more autonomy? Or does the LLM's inherent unreliability mean trust should never increase?

  5. Human attention allocation. Given finite human attention, which review points produce the highest return? GE's current model prioritizes specification review, but is that optimal?

These questions will be answered through continued operation and measurement. The wiki brain captures the evidence. Future analysis will extract the patterns.


The Core Insight

Agents are not autonomous developers. They are specialized workers in a pipeline with checks and balances. The moment you treat an LLM as an autonomous decision-maker, you have introduced unbounded risk. The moment you treat it as a constrained worker with mechanical verification at every stage, you have a system that can produce enterprise-grade software.

Trust but verify. At every stage. Without exception.


This Section

This methodology section documents how GE builds software with AI agents. It is organized by concern:


References

Academic

  • He, Treude, Lo. "LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead." ACM TOSEM, 2024. arXiv:2404.04834
  • Cemri, Pan, Yang. "Why Do Multi-Agent LLM Systems Fail?" arXiv:2503.13657, March 2025
  • "Designing LLM-based Multi-Agent Systems for SE Tasks: Quality Attributes, Design Patterns and Rationale." arXiv:2511.08475, 2025
  • "A Survey on LLM-based Multi-Agent System: Recent Advances and New Frontiers." arXiv:2412.17481, December 2024
  • Tawosi et al. "ALMAS: An Autonomous LLM-based Multi-Agent Software Engineering Framework." arXiv:2510.03463, October 2025

Industry

  • LangChain. "State of Agent Engineering." 2025
  • "Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering." arXiv:2601.14470, January 2026

GE Internal