Why Is the AI Agent Production Stack So Confusing?
If you're evaluating how to run AI agents in production, you've encountered a growing list of vendors. LangSmith for tracing. Portkey for gateway and observability. Guardrails AI for output validation. Braintrust for evaluation. And dozens more offering pieces of the production puzzle.
The challenge isn't finding tools. It's understanding what category each tool belongs to, what architectural layer it addresses, and what gaps remain when you assemble them into a stack.
Running agents in production requires five distinct capabilities. Most vendors cover one. Some cover two. Almost none compose all five into a unified execution plane. Understanding the layers helps enterprise teams build the right stack — and identify the Decision Infrastructure gap that no point solution addresses.
TL;DR
- Five layers are required to run AI agents in production: observability and tracing, guardrails and output validation, gateway and routing, evaluation and testing, and governed execution runtime.
- Most vendors cover one layer. LangSmith covers tracing. Guardrails AI covers output validation. Portkey covers gateway routing. Braintrust covers evaluation. None composes all five.
- The gaps between layers are where production failures live. Point solutions don't share context, policy state, or provenance — creating seams where silent failures, systemic risk, cost blowups, and audit failures occur.
- Layer 5 — the Governed Execution Runtime — is the missing substrate. It composes the other four layers into a unified execution plane with shared context, policy enforcement, and decision traces.
- Build Agents (ElixirData) provides this Layer 5, powered by Context OS, integrating with any agent framework and complementing existing observability, guardrails, gateway, and evaluation tools.
What Are the Five Layers of an AI Agent Production Stack?
The following table maps the complete production stack, from observability through governed execution. Each layer solves a specific problem — but each also has structural limits that the next layer must address.
| Layer | Function | Representative Vendors | Structural Limitation |
|---|---|---|---|
| 1. Observability & Tracing | Captures logs, traces, and metrics from agent execution | LangSmith, LangFuse, Arize, Helicone, Portkey (partial) | Shows what happened after the fact — cannot prevent bad actions from committing |
| 2. Guardrails & Output Validation | Validates agent outputs against predefined rules | Guardrails AI, NeMo Guardrails, Rebuff, Lakera | Runs after the agent has decided — cannot enforce policy during reasoning or tool execution |
| 3. Gateway & Routing | Manages LLM API calls with load balancing, fallbacks, caching, cost tracking | Portkey, LiteLLM, Martian, Not Diamond | Operates at the LLM API layer — cannot enforce business policy on tool calls or provide decision traces |
| 4. Evaluation & Testing | Tests agent behavior against benchmarks, regressions, and quality metrics | Braintrust, Promptfoo, Patronus AI, DeepEval | Runs in pre-production — cannot enforce governance at runtime or produce production-grade traces |
| 5. Governed Execution Runtime | Turns agent reasoning into deterministic, auditable execution with policy enforcement, tool brokering, and decision traces | ElixirData (Build Agents) | Composes the other four layers into a unified execution plane |
FAQ: Do I need all five layers?
Yes, for production-grade enterprise deployments. The first four are necessary but insufficient. Layer 5 provides the execution substrate that makes them work together as a governed system rather than independent point solutions.
Layer 1: What Does Observability and Tracing Solve — and What Doesn't It?
What it is: Capturing logs, traces, and metrics from agent execution for debugging and performance monitoring.
Vendors: LangSmith, LangFuse, Arize, Helicone, Portkey (partial).
What it solves: You can see what happened. You can debug failures. You can measure latency, token usage, and cost per call. For teams moving from experimentation to early deployment, observability is the first layer most adopt — and it's genuinely valuable.
What it doesn't solve: Observability is inherently reactive. It tells you what happened after the action has committed. It cannot:
- Prevent a bad action from executing in the first place
- Enforce policy and authority before tool calls commit
- Provide evidence-grade decision traces with context provenance, policy evaluation, and authority verification
- Enable replay with full context and policy state for forensic reconstruction
As we documented in the silent failure mode, an agent can show 100% task completion and zero errors in observability dashboards while producing systematically wrong outcomes. Observability captures system health — not decision quality.
FAQ: Can't LangSmith traces serve as audit evidence?
LangSmith traces capture LLM calls and tool invocations. They don't capture policy evaluations, authority verification, context provenance, or the institutional reasoning chain that regulators and auditors require. Decision traces and operational logs are fundamentally different data structures.
Layer 2: What Do Guardrails and Output Validation Solve — and What Don't They?
What it is: Validating agent outputs against predefined rules before they reach the user or a downstream system.
Vendors: Guardrails AI, NeMo Guardrails, Rebuff, Lakera.
What it solves: You can catch obviously bad outputs — PII leakage, toxic content, format violations, factual inconsistencies. For customer-facing agents, output validation is a necessary safety net.
What it doesn't solve: Output validation is a filter, not a runtime. It runs after the agent has already decided what to do and often after tool calls have already executed. It cannot:
- Prevent the agent from accessing unauthorized data during reasoning
- Enforce tenant isolation at tool execution time
- Apply budget limits or circuit breakers to tool call loops
- Guarantee idempotent tool execution or staged commits
As the guardrails objection analysis in Part 1 explains, post-hoc guardrails fail for three structural reasons: they're reactive, they don't compose across failure modes, and they cannot prove compliance to regulators. Policy enforcement must happen before execution, not after.
FAQ: Can guardrails prevent prompt injection attacks?
Guardrails can detect some injection patterns in outputs. They cannot prevent an injected instruction from causing unauthorized tool calls during agent reasoning. That requires policy and authority enforcement at the execution layer — Layer 5.
Layer 3: What Do Gateways and Routing Solve — and What Don't They?
What it is: Managing API calls between agents and LLM providers with load balancing, fallbacks, caching, and cost tracking.
Vendors: Portkey, LiteLLM, Martian, Not Diamond.
What it solves: You can manage multi-model deployments, reduce costs through semantic caching, route requests to optimal providers based on latency or cost, and maintain fallback chains when a provider is unavailable.
What it doesn't solve: Gateways operate at the LLM API layer — between the agent and the model. They have no visibility into what the agent does with the model's output. They cannot:
- Enforce business policy on tool calls that result from model reasoning
- Provide decision traces that capture context provenance and authority verification
- Apply staged commits, rollback, or idempotency to tool execution
- Evaluate whether an agent action is authorized under enterprise ABAC/ReBAC policies
Gateways are essential infrastructure for LLM cost management and reliability. But they address model access, not action governance. The zero-trust gateway pattern described in our pillar page operates at the tool execution layer, not the LLM API layer — a fundamentally different architectural position.
FAQ: Can Portkey's cost tracking replace runtime budget enforcement?
Portkey tracks LLM API costs. Tool execution costs — external API calls, compute, database operations — are invisible to LLM gateways. Runtime budget enforcement must operate at Layer 5, where all tool calls are brokered.
Layer 4: What Do Evaluation and Testing Solve — and What Don't They?
What it is: Testing agent behavior against benchmarks, regression suites, and quality metrics — primarily in pre-production environments.
Vendors: Braintrust, Promptfoo, Patronus AI, DeepEval.
What it solves: You can measure agent quality before deployment. You can detect regressions when prompts, models, or configurations change. You can score outputs against ground truth and identify quality degradation.
What it doesn't solve: Evaluation is predominantly a pre-production activity. It cannot:
- Enforce governance at runtime when agents encounter production contexts not covered by test suites
- Provide real-time policy enforcement during execution
- Generate production-grade decision traces that feed back into evaluation automatically
- Detect drift caused by changes in enterprise data, policies, or system state that test environments don't replicate
The feedback loops primitive in a Governed Agent Runtime closes this gap by routing production decision traces into evaluation pipelines automatically — enabling continuous improvement measured against real outcomes, not synthetic benchmarks. This is what ElixirData calls Agentic Context Engineering.
FAQ: Can evaluation tools detect production failures?
Pre-production evaluation can catch some failure patterns. But silent failures caused by stale production context, cost blowups from novel reasoning loops, and systemic risk from prompt injections only manifest in production. Runtime governance is required.
Why Do the Gaps Between Layers Cause Production Failures?
Most enterprises attempt to assemble their agent production stack from point solutions. They add LangSmith for tracing, Guardrails AI for output validation, Portkey for gateway management, and Braintrust for evaluation. Each tool solves its layer well.
But the gaps between layers are where production failures live.
Consider three scenarios that point-solution stacks cannot prevent:
- The guardrail catches a bad output, but the tool call already executed. A payment was processed, a database was modified, or a compliance workflow was triggered — the output filter stopped the response to the user, but the downstream action already committed. The guardrail and the tool execution layer don't share a common commit protocol.
- The trace shows what happened, but can't reconstruct why the policy was violated. The observability tool captured the API calls. But it didn't capture the context bundle the agent reasoned from, the policy version that was (or wasn't) evaluated, or the authority chain that authorized the action. Logs are not evidence.
- The evaluation suite passes in pre-production, but the agent encounters a context configuration in production that wasn't tested. Enterprise data changes constantly. Policies update. Tenant configurations shift. Pre-production evaluation can't cover the combinatorial space of production context. Only runtime governance with deterministic context compilation can ensure decision-grade context at execution time.
These gaps exist because point solutions don't share a common execution model. They don't share context, policy state, or provenance. Each tool operates independently, creating seams where the five failure modes of ungoverned agent execution occur.
FAQ: Can't I integrate these point solutions together via APIs?
API integration connects data flows. It doesn't create a shared execution model with unified context, policy state, and provenance. The composition problem is architectural, not an integration problem.
What Does Layer 5 — the Governed Execution Runtime — Provide?
Layer 5 is the Governed Agent Runtime — the control layer that composes the other four layers into a unified execution plane. It is not another point solution. It is the execution substrate that makes observability, guardrails, gateways, and evaluation work together as a governed production system.
Every agent action flows through the canonical six-step runtime loop:
- Request — with identity and scope attached
- Compile Context — deterministic context compilation from systems of record
- Evaluate Policy — ABAC/ReBAC policy enforcement producing allow/modify/approve/block
- Execute (Controlled) — tool calls through the Tool Broker with staged commits, idempotency, and isolation
- Decision Trace — evidence-grade record capturing the full provenance chain
- Improve — trace feeds feedback loops for regression detection and policy tuning
The five execution primitives work together because they share a common execution model — unified context, policy state, and provenance across every step.
How Does Layer 5 Enhance the Other Four Layers?
| Existing Layer | Without Layer 5 | With Layer 5 (Governed Execution Runtime) |
|---|---|---|
| Observability (LangSmith) | Traces capture LLM calls and tool invocations | Traces enriched with policy evaluations, context provenance, and authority verification |
| Guardrails (Guardrails AI) | Filters run after agent decides and tool calls execute | Policy gates enforce governance before execution — guardrails become a secondary safety net, not the primary control |
| Gateway (Portkey) | Routes LLM API calls based on cost and latency | Routing decisions informed by runtime policy — model selection considers governance requirements, not just cost |
| Evaluation (Braintrust) | Test suites run against synthetic benchmarks in pre-production | Evaluation fed by production decision traces — continuous improvement against real outcomes via Agentic Context Engineering |
FAQ: Does Build Agents replace LangSmith, Portkey, or Guardrails AI?
No. It provides the execution substrate they plug into. Your existing tools become more effective because they operate within a runtime that provides shared context, policy state, and provenance.
Why Is Framework Neutrality an Architectural Requirement?
Build Agents works with any agent framework — LangGraph, CrewAI, AutoGen, Semantic Kernel, Haystack, or custom orchestration logic. Your agents reason in your framework. They execute through the governed runtime.
This is a deliberate architectural choice driven by two principles:
-
First, frameworks evolve rapidly. New reasoning approaches, orchestration patterns, and multi-agent architectures emerge monthly. Coupling governance to a specific framework means rebuilding compliance, security, and audit infrastructure every time you change reasoning tools.
-
Second, governance requirements are more stable than reasoning patterns. The need for policy enforcement, decision traces, tenant isolation, and budget controls doesn't change when you switch from LangGraph to CrewAI. By separating reasoning from execution governance, enterprises can adopt new frameworks without rebuilding their Decision Infrastructure.
This is the same architectural principle behind Context OS: the operating layer manages context, policy, authority, and evidence independently of the reasoning framework above it and the enterprise systems below it.
FAQ: What if we switch frameworks next year?
That's exactly why framework neutrality matters. Your governance infrastructure, decision traces, policies, and compliance records remain intact. Only the reasoning layer changes.
Conclusion: From Point Solutions to a Governed Execution Plane
The AI agent production stack is not a single-vendor problem. Observability, guardrails, gateways, and evaluation each solve real challenges. But assembled as independent point solutions, they leave structural gaps where the five failure modes of ungoverned agent execution occur.
The missing layer is the Governed Execution Runtime — the substrate that composes the other four layers into a unified execution plane with shared context, policy state, and provenance. This is Layer 5: the Decision Infrastructure that turns an assembly of tools into a governed production system.
For enterprise teams evaluating their agent production stack, the question isn't which point solutions to buy. It's whether the stack has a common execution model — and whether every agent action flows through governed context, enforced Decision Boundaries, and recorded evidence.
Most vendors cover one layer. Context OS composes them into a governed execution plane: deterministic context compilation + policy enforcement + action commit protocol + decision ledger + feedback loops.

