Three-model architecture, Context OS, and the Plan.MD pipeline
By ElixirData Engineering | March 2026
Most teams adopting AI coding assistants in 2026 follow the same playbook: subscribe to a cloud API, plug it into the IDE, and hope for the best. The results are often underwhelming. The AI generates code that does not fit the codebase, calls APIs that do not exist, and applies patterns that were deprecated months ago. Developers spend as much time correcting the output as they would have spent writing it themselves.
This cycle is familiar enough that it has a name: the Frustration Loop. Generate code, review it, find it does not fit, regenerate with corrections, review again, eventually accept heavily-modified output or abandon the attempt entirely.
ElixirData faced these problems — and then added another constraint: the entire system has to run offline, inside an air-gapped development network, with zero data leaving the perimeter. No cloud fallback. No external API calls. No live documentation retrieval.
This constraint turned out to be clarifying. It forced the team to confront a truth that cloud-connected teams can afford to ignore: the quality of an AI coding assistant is determined not by the model it runs, but by the context infrastructure underneath it — and more specifically, by the quality of the plan the model works from.
The thesis: The model is the engine. Context is the fuel. The plan is the route. Better fuel and a better route produce better output — regardless of the engine.
A typical AI coding assistant operates with two sources of knowledge: training data (static, frozen at a cutoff date) and whatever context fits in the current prompt (dynamic but shallow). This creates two distinct failures that produce the same downstream result: bad code.
Internal context failure: The assistant does not understand the codebase. It does not know the project uses Fastify instead of Express, that files go in lib/services/ instead of utils/, or that the codebase follows a functional style. When it needs to understand the authentication flow, it reads files one by one, burning hundreds of thousands of tokens for what a structured index could answer in a few hundred.
External context failure: The assistant does not have current knowledge of libraries and APIs. It falls back on training data — which may be months or years stale. For fast-moving libraries, this means code that calls methods that no longer exist, passes parameters in the wrong order, or misses simpler approaches added in recent versions.
Project-reported benchmarking from codebase-memory-mcp shows approximately 3,400 tokens via structured indexing versus approximately 412,000 tokens via file-by-file grep for the same five structural queries — roughly a 120x reduction.
These failures compound. As Dex Horthy of HumanLayer articulated: a bad line of code is a bad line of code, but a bad line in a plan leads to hundreds of bad lines of code, and bad research leads to thousands. Upstream errors cascade and amplify downstream.
If planning quality is the primary lever — if the accuracy of the plan determines the quality of everything downstream — then the architecture should optimize for planning quality above all else.
Part of the problem lies in how success is measured. Teams often track time to first output or lines of code generated. If an AI generates 200 lines in seconds but a developer spends 30 minutes refactoring them, the net productivity gain is questionable.
| Misleading metric | More useful alternative |
|---|---|
| Time to first output | First-pass acceptance rate |
| Lines of code generated | Iteration cycles per task |
| Tasks completed | Post-merge rework required |
| Generation speed | Review burden vs manual writing |
ElixirData's solution inverts the conventional single-model approach. Three models operate in a pipeline, each optimized for a different cognitive task.
Request → [Planner] → Plan.MD → [Generator] → Code → [Completer] → Edits
| Property | Specification |
|---|---|
| Model | Qwen3-Coder-480B (35B active parameters, MoE) |
| Role | Research codebase, reason about architecture, produce Plan.MD |
| License | Apache 2.0 |
| Context window | 256K–1M tokens |
| Latency target | 30–60 seconds (thoroughness over speed) |
| Token budget | Up to 128K (deep codebase understanding) |
| Cloud alternative | Claude Opus 4.6 for non-sensitive work outside the air gap |
The planner is the most important model in the system. Its job is to research the codebase, understand the request, reason about architectural implications, and produce a detailed Plan.MD that specifies exactly what to build, where to build it, which patterns to follow, and which edge cases to handle.
For teams with cloud access for non-sensitive work, Claude Opus 4.6 is the ideal planner. Its adaptive thinking, 500K–1M token context window, and frontier-level reasoning make it the strongest option available. Use it outside the air gap for architecture research and specification generation, then transfer the outputs via secure media.
| Property | Specification |
|---|---|
| Model | GLM-5 (744B total, 40B active parameters, MoE) |
| Role | Execute Plan.MD — write the actual code |
| License | MIT |
| SWE-bench Verified | 77.8% (best open-source) |
| Latency target | < 15 seconds first token |
| Token budget | Up to 64K (Plan.MD has narrowed the scope) |
| Alternatives | Kimi K2.5 (99% HumanEval, 32B active) or DeepSeek-V3.2 (MIT) |
The generator receives Plan.MD and executes against it. It does not need to reason about architecture — the planner already did that. It translates a precise specification into correct, convention-following code.
When Plan.MD is accurate — when it correctly identifies files, patterns, edge cases, and conventions — even a mid-tier generation model will produce correct code. The plan has already done the hard cognitive work.
| Property | Specification |
|---|---|
| Model | Qwen3-Coder-Next (80B total, 3B active parameters, MoE) |
| Role | Fast inline completions and small edits |
| License | Apache 2.0 |
| SWE-bench | 70.6% (comparable to models 10–20x larger) |
| Latency target | < 500 milliseconds |
| Token budget | Up to 8K (immediate editing context only) |
| What it handles | Ghost text, autocomplete, FIM, single-function refactors |
The completer handles 80% of daily interactions by volume. It does not need Plan.MD or full context retrieval — it needs the current file, the imports, and the path-scoped rules.
Plan.MD is the critical artifact in the pipeline. It serves as an explicit, inspectable, human-reviewable contract between the planner and the generator. A well-formed Plan.MD includes:
Research findings: What the planner learned about the codebase (with file paths and line numbers)
Files to create and modify: Specific locations, not vague module references
Architecture decisions: Chosen approach with rationale
Edge cases to handle: Failure modes, boundary conditions, exceptional flows
Test plan: What to test and how
Conventions to follow: Referenced from CLAUDE.md with specific examples
The specificity matters. "Modify the auth module" is not a plan. "Add rate limiting check in src/gateway/middleware.py after line 47, following the decorator pattern used in auth.py:23-47" is a plan. The more specific the plan, the less the generator needs to guess — and guessing is where models fail.
Critically, Plan.MD is a human-readable artifact. A developer can review it before generation begins and catch errors at the plan stage, where they are cheap to fix, rather than at the code stage, where they cascade.
The Context OS is the orchestration layer that ensures each model sees the right context at the right time. It manages three resources that behave like operating system primitives.
Token budget as memory: A model's context window is finite. The Context OS allocates space to the most relevant context, evicts stale information, and ensures totals stay within limits. Dumping everything degrades performance the same way memory swapping does.
Retrieval as I/O: Fetching context from the codebase index, documentation packs, and convention rules is analogous to disk I/O. The Context OS caches frequently-accessed context, prefetches likely-needed documentation, and batches retrieval calls.
Model routing as process scheduling: Completion requests need speed (sub-500ms). Generation requests need quality (15 seconds). Planning requests need thoroughness (30+ seconds). The Context OS routes each request to the appropriate model.
When a developer makes a request, the Context OS executes six steps:
Codebase index: Persistent representation of the entire codebase using Tree-sitter for AST parsing and Qdrant for vector storage. Code split at meaningful boundaries — functions, classes, logical blocks. Updated incrementally via Git hooks on every merge to main.
Documentation packs: Curated, version-pinned documentation for key dependencies as markdown files optimized for LLM consumption. Also includes actual library source code for critical dependencies. Version-pinned to exact lockfile versions.
Convention rules: Structured encoding of coding standards following Claude Code's hierarchy: CLAUDE.md (always loaded), path-scoped rules (loaded per file type), and skills (lazy-loaded bundles of instructions and resources).
Claude Code's system prompts are fully documented in the open-source repository Piebald-AI/claude-code-system-prompts. As of March 2026 (version 2.1.84), this includes the core system prompt (2,896 tokens), 18 built-in tool descriptions, plan subagent prompt (636 tokens), explore subagent prompt (494 tokens), task subagent prompt (294 tokens), approximately 40 system reminders, and tracking across 133 versions.
| Claude Code component | ElixirData mapping | Implementation |
|---|---|---|
| Plan subagent (636 tok) | Tier 1 — Planner | Adapted as system prompt for Qwen3-Coder-480B. Produces Plan.MD as standalone artifact. |
| Explore subagent (494 tok) | Tier 1 — Planner explore mode | Read-only research mode for understanding codebase structure before planning. |
| Task subagent (294 tok) | Tier 2 — Generator | Receives Plan.MD sections as individual tasks. Executes with TodoWrite tracking. |
| Core prompt (2,896 tok) | All tiers — adapted per tier | Conciseness mandate, tool-first approach, file reference format across all models. |
Conciseness mandate: Claude Code instructs the model to "answer concisely with fewer than 4 lines unless the user asks for detail" and to "minimize output tokens." Directly adopted — verbose output wastes developer time and context window space.
TodoWrite pattern: Claude Code forces the agent to maintain a task list. ElixirData implements this at the pipeline level: Plan.MD is converted into a todo list that the generator works through sequentially, with real-time progress visible to the developer.
Hooks system: Claude Code triggers scripts at lifecycle events. ElixirData implements: post-edit hooks (run linter), post-generation hooks (trigger tests), post-session hooks (log to feedback flywheel), and index-update hooks (re-index after merge).
The meta-strategy: Use Claude itself (via API, outside the air gap for non-sensitive work) to write and refine the CLAUDE.md, rules, skills, and system prompts that the local models execute. Using the strongest model to configure the others is the highest-leverage approach.
CLAUDE.md (always loaded): Project-wide conventions — tech stack with versions, directory structure, naming conventions, coding standards. Start with 5–10 conventions and build iteratively.
Path-scoped rules (loaded per file type): Convention guidance triggered by file glob patterns. Example: *.test.ts triggers test-specific rules about Vitest. Keeps always-loaded CLAUDE.md small.
Skills (lazy-loaded on demand): Rich bundles of instructions, documentation, scripts, and examples. Each has a SKILL.md with a description the model uses to decide when to load it.
| Component | Specification | Purpose |
|---|---|---|
| GPU nodes (Planner + Gen) | 2x NVIDIA A100 80GB or H100 | One for planner, one for generator. MoE architectures use VRAM efficiently. |
| GPU node (Completer) | 1x A100 80GB or RTX 4090 24GB | Dedicated to inline completions. RTX 4090 is the cost-optimized option. |
| CPU server | 32+ cores, 128GB RAM, 2TB NVMe | Context OS: indexer, Qdrant, API gateway, monitoring, embedding model. |
| Storage | NAS, 5TB minimum | Model weights, index persistence, audit logs. |
| Network | 10GbE between nodes | Low-latency internal communication. |
Total hardware estimate: $50,000–$120,000 depending on GPU selection. The software stack is entirely open-source: vLLM for inference, Tree-sitter + Qdrant for indexing, ripgrep for search, OpenTelemetry for monitoring.
Network isolation: Dedicated VLAN, no route to internet. All processing stays on-premise.
Authentication: LDAP/Active Directory integration. API tokens scoped to team and repository access.
Audit logging: All prompts and responses logged to tamper-evident store.
Model provenance: Cryptographic hash verification at deployment. Chain of custody for all weights.
Code isolation: No execution outside inference sandbox. Tool use mediated with explicit permissions.
The system includes a closed-loop improvement mechanism that compounds quality over time, even without model upgrades.
Capture: Every interaction is logged — the request, Plan.MD, generated code, and the developer's response (accept / modify / reject).
Analyze: Weekly automated analysis identifies recurring failure patterns. Example: "73% of rejections in payments module involve incorrect webhook handling."
Improve: Failure patterns become targeted context improvements: planner failures → better CLAUDE.md or skills; generator failures → specific path-scoped rules; library misuse → updated documentation packs.
Measure: Track first-pass acceptance rate by tier. Planner acceptance dropping? Planner context needs attention. Generation dropping with good plans? Generator rules need tuning.
Why it compounds: every convention rule added is a failure mode eliminated permanently. Every documentation pack updated is a class of hallucination removed. After 90 days, the rules file will be smaller, more targeted, and more effective than anything written on day one.
The most important property of this architecture is that it compounds over time, even without model upgrades. Every convention rule added eliminates a failure mode permanently. Every Plan.MD template refined produces better plans. Every feedback cycle converts developer frustration into systematic improvement.
The open-source ecosystem has delivered models capable of running this pipeline today. Claude Code's published system prompts provide the behavioral blueprint. The three-model architecture decomposes the problem into stages matching how experienced developers actually work. And Plan.MD — a simple markdown file specifying what to build before anything gets built — turns out to be the highest-leverage artifact in the entire system.
The thesis: The model is the engine. Context is the fuel. The plan is the route. Better fuel and a better route produce better output — regardless of the engine.
Request an Executive Briefing to see how Context OS governs AI agents from coding to enterprise execution.
Request Executive Briefing → demo.elixirdata.co
ElixirData | Context OS™ | The governed operating system for enterprise AI agents