If you're evaluating how to run AI agents in production, you've encountered a growing list of vendors. LangSmith for tracing. Portkey for gateway and observability. Guardrails AI for output validation. Braintrust for evaluation. And dozens more offering pieces of the production puzzle.
The challenge isn't finding tools. It's understanding what category each tool belongs to, what architectural layer it addresses, and what gaps remain when you assemble them into a stack.
Running agents in production requires five distinct capabilities. Most vendors cover one. Some cover two. Almost none compose all five into a unified execution plane. Understanding the layers helps enterprise teams build the right stack — and identify the Decision Infrastructure gap that no point solution addresses.
The following table maps the complete production stack, from observability through governed execution. Each layer solves a specific problem — but each also has structural limits that the next layer must address.
| Layer | Function | Representative Vendors | Structural Limitation |
|---|---|---|---|
| 1. Observability & Tracing | Captures logs, traces, and metrics from agent execution | LangSmith, LangFuse, Arize, Helicone, Portkey (partial) | Shows what happened after the fact — cannot prevent bad actions from committing |
| 2. Guardrails & Output Validation | Validates agent outputs against predefined rules | Guardrails AI, NeMo Guardrails, Rebuff, Lakera | Runs after the agent has decided — cannot enforce policy during reasoning or tool execution |
| 3. Gateway & Routing | Manages LLM API calls with load balancing, fallbacks, caching, cost tracking | Portkey, LiteLLM, Martian, Not Diamond | Operates at the LLM API layer — cannot enforce business policy on tool calls or provide decision traces |
| 4. Evaluation & Testing | Tests agent behavior against benchmarks, regressions, and quality metrics | Braintrust, Promptfoo, Patronus AI, DeepEval | Runs in pre-production — cannot enforce governance at runtime or produce production-grade traces |
| 5. Governed Execution Runtime | Turns agent reasoning into deterministic, auditable execution with policy enforcement, tool brokering, and decision traces | ElixirData (Build Agents) | Composes the other four layers into a unified execution plane |
FAQ: Do I need all five layers?
Yes, for production-grade enterprise deployments. The first four are necessary but insufficient. Layer 5 provides the execution substrate that makes them work together as a governed system rather than independent point solutions.
What it is: Capturing logs, traces, and metrics from agent execution for debugging and performance monitoring.
Vendors: LangSmith, LangFuse, Arize, Helicone, Portkey (partial).
What it solves: You can see what happened. You can debug failures. You can measure latency, token usage, and cost per call. For teams moving from experimentation to early deployment, observability is the first layer most adopt — and it's genuinely valuable.
What it doesn't solve: Observability is inherently reactive. It tells you what happened after the action has committed. It cannot:
As we documented in the silent failure mode, an agent can show 100% task completion and zero errors in observability dashboards while producing systematically wrong outcomes. Observability captures system health — not decision quality.
FAQ: Can't LangSmith traces serve as audit evidence?
LangSmith traces capture LLM calls and tool invocations. They don't capture policy evaluations, authority verification, context provenance, or the institutional reasoning chain that regulators and auditors require. Decision traces and operational logs are fundamentally different data structures.
What it is: Validating agent outputs against predefined rules before they reach the user or a downstream system.
Vendors: Guardrails AI, NeMo Guardrails, Rebuff, Lakera.
What it solves: You can catch obviously bad outputs — PII leakage, toxic content, format violations, factual inconsistencies. For customer-facing agents, output validation is a necessary safety net.
What it doesn't solve: Output validation is a filter, not a runtime. It runs after the agent has already decided what to do and often after tool calls have already executed. It cannot:
As the guardrails objection analysis in Part 1 explains, post-hoc guardrails fail for three structural reasons: they're reactive, they don't compose across failure modes, and they cannot prove compliance to regulators. Policy enforcement must happen before execution, not after.
FAQ: Can guardrails prevent prompt injection attacks?
Guardrails can detect some injection patterns in outputs. They cannot prevent an injected instruction from causing unauthorized tool calls during agent reasoning. That requires policy and authority enforcement at the execution layer — Layer 5.
What it is: Managing API calls between agents and LLM providers with load balancing, fallbacks, caching, and cost tracking.
Vendors: Portkey, LiteLLM, Martian, Not Diamond.
What it solves: You can manage multi-model deployments, reduce costs through semantic caching, route requests to optimal providers based on latency or cost, and maintain fallback chains when a provider is unavailable.
What it doesn't solve: Gateways operate at the LLM API layer — between the agent and the model. They have no visibility into what the agent does with the model's output. They cannot:
Gateways are essential infrastructure for LLM cost management and reliability. But they address model access, not action governance. The zero-trust gateway pattern described in our pillar page operates at the tool execution layer, not the LLM API layer — a fundamentally different architectural position.
FAQ: Can Portkey's cost tracking replace runtime budget enforcement?
Portkey tracks LLM API costs. Tool execution costs — external API calls, compute, database operations — are invisible to LLM gateways. Runtime budget enforcement must operate at Layer 5, where all tool calls are brokered.
What it is: Testing agent behavior against benchmarks, regression suites, and quality metrics — primarily in pre-production environments.
Vendors: Braintrust, Promptfoo, Patronus AI, DeepEval.
What it solves: You can measure agent quality before deployment. You can detect regressions when prompts, models, or configurations change. You can score outputs against ground truth and identify quality degradation.
What it doesn't solve: Evaluation is predominantly a pre-production activity. It cannot:
The feedback loops primitive in a Governed Agent Runtime closes this gap by routing production decision traces into evaluation pipelines automatically — enabling continuous improvement measured against real outcomes, not synthetic benchmarks. This is what ElixirData calls Agentic Context Engineering.
FAQ: Can evaluation tools detect production failures?
Pre-production evaluation can catch some failure patterns. But silent failures caused by stale production context, cost blowups from novel reasoning loops, and systemic risk from prompt injections only manifest in production. Runtime governance is required.
Most enterprises attempt to assemble their agent production stack from point solutions. They add LangSmith for tracing, Guardrails AI for output validation, Portkey for gateway management, and Braintrust for evaluation. Each tool solves its layer well.
But the gaps between layers are where production failures live.
Consider three scenarios that point-solution stacks cannot prevent:
These gaps exist because point solutions don't share a common execution model. They don't share context, policy state, or provenance. Each tool operates independently, creating seams where the five failure modes of ungoverned agent execution occur.
FAQ: Can't I integrate these point solutions together via APIs?
API integration connects data flows. It doesn't create a shared execution model with unified context, policy state, and provenance. The composition problem is architectural, not an integration problem.
Layer 5 is the Governed Agent Runtime — the control layer that composes the other four layers into a unified execution plane. It is not another point solution. It is the execution substrate that makes observability, guardrails, gateways, and evaluation work together as a governed production system.
Every agent action flows through the canonical six-step runtime loop:
The five execution primitives work together because they share a common execution model — unified context, policy state, and provenance across every step.
| Existing Layer | Without Layer 5 | With Layer 5 (Governed Execution Runtime) |
|---|---|---|
| Observability (LangSmith) | Traces capture LLM calls and tool invocations | Traces enriched with policy evaluations, context provenance, and authority verification |
| Guardrails (Guardrails AI) | Filters run after agent decides and tool calls execute | Policy gates enforce governance before execution — guardrails become a secondary safety net, not the primary control |
| Gateway (Portkey) | Routes LLM API calls based on cost and latency | Routing decisions informed by runtime policy — model selection considers governance requirements, not just cost |
| Evaluation (Braintrust) | Test suites run against synthetic benchmarks in pre-production | Evaluation fed by production decision traces — continuous improvement against real outcomes via Agentic Context Engineering |
FAQ: Does Build Agents replace LangSmith, Portkey, or Guardrails AI?
No. It provides the execution substrate they plug into. Your existing tools become more effective because they operate within a runtime that provides shared context, policy state, and provenance.
Build Agents works with any agent framework — LangGraph, CrewAI, AutoGen, Semantic Kernel, Haystack, or custom orchestration logic. Your agents reason in your framework. They execute through the governed runtime.
This is a deliberate architectural choice driven by two principles:
First, frameworks evolve rapidly. New reasoning approaches, orchestration patterns, and multi-agent architectures emerge monthly. Coupling governance to a specific framework means rebuilding compliance, security, and audit infrastructure every time you change reasoning tools.
Second, governance requirements are more stable than reasoning patterns. The need for policy enforcement, decision traces, tenant isolation, and budget controls doesn't change when you switch from LangGraph to CrewAI. By separating reasoning from execution governance, enterprises can adopt new frameworks without rebuilding their Decision Infrastructure.
This is the same architectural principle behind Context OS: the operating layer manages context, policy, authority, and evidence independently of the reasoning framework above it and the enterprise systems below it.
FAQ: What if we switch frameworks next year?
That's exactly why framework neutrality matters. Your governance infrastructure, decision traces, policies, and compliance records remain intact. Only the reasoning layer changes.
The AI agent production stack is not a single-vendor problem. Observability, guardrails, gateways, and evaluation each solve real challenges. But assembled as independent point solutions, they leave structural gaps where the five failure modes of ungoverned agent execution occur.
The missing layer is the Governed Execution Runtime — the substrate that composes the other four layers into a unified execution plane with shared context, policy state, and provenance. This is Layer 5: the Decision Infrastructure that turns an assembly of tools into a governed production system.
For enterprise teams evaluating their agent production stack, the question isn't which point solutions to buy. It's whether the stack has a common execution model — and whether every agent action flows through governed context, enforced Decision Boundaries, and recorded evidence.
Most vendors cover one layer. Context OS composes them into a governed execution plane: deterministic context compilation + policy enforcement + action commit protocol + decision ledger + feedback loops.