Three-model architecture, Context OS, and the Plan.MD pipeline
By ElixirData Engineering | March 2026
Key takeaways
- AI coding assistants fail because of context infrastructure, not model capability — 120x token reduction from structured indexing vs file-by-file grep
- A three-model pipeline (Planner → Generator → Completer) decomposes the problem into stages matching how experienced developers actually work
- Plan.MD — a detailed, human-reviewable specification produced before any code is generated — is the highest-leverage artifact in the system
- Context OS orchestrates context across all three tiers: token budget as memory, retrieval as I/O, model routing as process scheduling
- The entire system runs offline in an air-gapped network using open-source models — zero data leaves the perimeter
Why AI Coding Assistants Fail
Most teams adopting AI coding assistants in 2026 follow the same playbook: subscribe to a cloud API, plug it into the IDE, and hope for the best. The results are often underwhelming. The AI generates code that does not fit the codebase, calls APIs that do not exist, and applies patterns that were deprecated months ago. Developers spend as much time correcting the output as they would have spent writing it themselves.
This cycle is familiar enough that it has a name: the Frustration Loop. Generate code, review it, find it does not fit, regenerate with corrections, review again, eventually accept heavily-modified output or abandon the attempt entirely.
ElixirData faced these problems — and then added another constraint: the entire system has to run offline, inside an air-gapped development network, with zero data leaving the perimeter. No cloud fallback. No external API calls. No live documentation retrieval.
This constraint turned out to be clarifying. It forced the team to confront a truth that cloud-connected teams can afford to ignore: the quality of an AI coding assistant is determined not by the model it runs, but by the context infrastructure underneath it — and more specifically, by the quality of the plan the model works from.
The thesis: The model is the engine. Context is the fuel. The plan is the route. Better fuel and a better route produce better output — regardless of the engine.
Two context gaps, same failure mode
A typical AI coding assistant operates with two sources of knowledge: training data (static, frozen at a cutoff date) and whatever context fits in the current prompt (dynamic but shallow). This creates two distinct failures that produce the same downstream result: bad code.
Internal context failure: The assistant does not understand the codebase. It does not know the project uses Fastify instead of Express, that files go in lib/services/ instead of utils/, or that the codebase follows a functional style. When it needs to understand the authentication flow, it reads files one by one, burning hundreds of thousands of tokens for what a structured index could answer in a few hundred.
External context failure: The assistant does not have current knowledge of libraries and APIs. It falls back on training data — which may be months or years stale. For fast-moving libraries, this means code that calls methods that no longer exist, passes parameters in the wrong order, or misses simpler approaches added in recent versions.
Project-reported benchmarking from codebase-memory-mcp shows approximately 3,400 tokens via structured indexing versus approximately 412,000 tokens via file-by-file grep for the same five structural queries — roughly a 120x reduction.
The cascade problem
These failures compound. As Dex Horthy of HumanLayer articulated: a bad line of code is a bad line of code, but a bad line in a plan leads to hundreds of bad lines of code, and bad research leads to thousands. Upstream errors cascade and amplify downstream.
If planning quality is the primary lever — if the accuracy of the plan determines the quality of everything downstream — then the architecture should optimize for planning quality above all else.
The speed trap
Part of the problem lies in how success is measured. Teams often track time to first output or lines of code generated. If an AI generates 200 lines in seconds but a developer spends 30 minutes refactoring them, the net productivity gain is questionable.
| Misleading metric | More useful alternative |
|---|---|
| Time to first output | First-pass acceptance rate |
| Lines of code generated | Iteration cycles per task |
| Tasks completed | Post-merge rework required |
| Generation speed | Review burden vs manual writing |
The Three-Model Architecture
ElixirData's solution inverts the conventional single-model approach. Three models operate in a pipeline, each optimized for a different cognitive task.
Request → [Planner] → Plan.MD → [Generator] → Code → [Completer] → Edits
Tier 1: The planner
| Property | Specification |
|---|---|
| Model | Qwen3-Coder-480B (35B active parameters, MoE) |
| Role | Research codebase, reason about architecture, produce Plan.MD |
| License | Apache 2.0 |
| Context window | 256K–1M tokens |
| Latency target | 30–60 seconds (thoroughness over speed) |
| Token budget | Up to 128K (deep codebase understanding) |
| Cloud alternative | Claude Opus 4.6 for non-sensitive work outside the air gap |
The planner is the most important model in the system. Its job is to research the codebase, understand the request, reason about architectural implications, and produce a detailed Plan.MD that specifies exactly what to build, where to build it, which patterns to follow, and which edge cases to handle.
For teams with cloud access for non-sensitive work, Claude Opus 4.6 is the ideal planner. Its adaptive thinking, 500K–1M token context window, and frontier-level reasoning make it the strongest option available. Use it outside the air gap for architecture research and specification generation, then transfer the outputs via secure media.
Tier 2: The generator
| Property | Specification |
|---|---|
| Model | GLM-5 (744B total, 40B active parameters, MoE) |
| Role | Execute Plan.MD — write the actual code |
| License | MIT |
| SWE-bench Verified | 77.8% (best open-source) |
| Latency target | < 15 seconds first token |
| Token budget | Up to 64K (Plan.MD has narrowed the scope) |
| Alternatives | Kimi K2.5 (99% HumanEval, 32B active) or DeepSeek-V3.2 (MIT) |
The generator receives Plan.MD and executes against it. It does not need to reason about architecture — the planner already did that. It translates a precise specification into correct, convention-following code.
When Plan.MD is accurate — when it correctly identifies files, patterns, edge cases, and conventions — even a mid-tier generation model will produce correct code. The plan has already done the hard cognitive work.
Tier 3: The completer
| Property | Specification |
|---|---|
| Model | Qwen3-Coder-Next (80B total, 3B active parameters, MoE) |
| Role | Fast inline completions and small edits |
| License | Apache 2.0 |
| SWE-bench | 70.6% (comparable to models 10–20x larger) |
| Latency target | < 500 milliseconds |
| Token budget | Up to 8K (immediate editing context only) |
| What it handles | Ghost text, autocomplete, FIM, single-function refactors |
The completer handles 80% of daily interactions by volume. It does not need Plan.MD or full context retrieval — it needs the current file, the imports, and the path-scoped rules.
Plan.MD: The contract between stages
Plan.MD is the critical artifact in the pipeline. It serves as an explicit, inspectable, human-reviewable contract between the planner and the generator. A well-formed Plan.MD includes:
Research findings: What the planner learned about the codebase (with file paths and line numbers)
Files to create and modify: Specific locations, not vague module references
Architecture decisions: Chosen approach with rationale
Edge cases to handle: Failure modes, boundary conditions, exceptional flows
Test plan: What to test and how
Conventions to follow: Referenced from CLAUDE.md with specific examples
The specificity matters. "Modify the auth module" is not a plan. "Add rate limiting check in src/gateway/middleware.py after line 47, following the decorator pattern used in auth.py:23-47" is a plan. The more specific the plan, the less the generator needs to guess — and guessing is where models fail.
Critically, Plan.MD is a human-readable artifact. A developer can review it before generation begins and catch errors at the plan stage, where they are cheap to fix, rather than at the code stage, where they cascade.
The Context Operating System
The Context OS is the orchestration layer that ensures each model sees the right context at the right time. It manages three resources that behave like operating system primitives.
Three resources, three management strategies
Token budget as memory: A model's context window is finite. The Context OS allocates space to the most relevant context, evicts stale information, and ensures totals stay within limits. Dumping everything degrades performance the same way memory swapping does.
Retrieval as I/O: Fetching context from the codebase index, documentation packs, and convention rules is analogous to disk I/O. The Context OS caches frequently-accessed context, prefetches likely-needed documentation, and batches retrieval calls.
Model routing as process scheduling: Completion requests need speed (sub-500ms). Generation requests need quality (15 seconds). Planning requests need thoroughness (30+ seconds). The Context OS routes each request to the appropriate model.
The context pipeline
When a developer makes a request, the Context OS executes six steps:
- Request classification: Determine request type. Completions go directly to Tier 3. Generation and review enter the full pipeline.
- Convention loading: Load CLAUDE.md (always on), path-scoped rules for relevant file types, and matching skills.
- Codebase retrieval: Two-stage: lexical search (ripgrep, milliseconds, 60–70% of queries) then semantic search (Qdrant vectors, 30–40%).
- Documentation retrieval: Pull version-pinned library documentation for dependencies involved.
- Context assembly: Merge, rank by relevance, trim to token budget. Priority: conventions > code spans > docs > broader context.
- Enriched inference: Assembled context combined with request and sent to the appropriate model.
The three context subsystems
Codebase index: Persistent representation of the entire codebase using Tree-sitter for AST parsing and Qdrant for vector storage. Code split at meaningful boundaries — functions, classes, logical blocks. Updated incrementally via Git hooks on every merge to main.
Documentation packs: Curated, version-pinned documentation for key dependencies as markdown files optimized for LLM consumption. Also includes actual library source code for critical dependencies. Version-pinned to exact lockfile versions.
Convention rules: Structured encoding of coding standards following Claude Code's hierarchy: CLAUDE.md (always loaded), path-scoped rules (loaded per file type), and skills (lazy-loaded bundles of instructions and resources).
Building from Claude Code's System Prompts
Claude Code's system prompts are fully documented in the open-source repository Piebald-AI/claude-code-system-prompts. As of March 2026 (version 2.1.84), this includes the core system prompt (2,896 tokens), 18 built-in tool descriptions, plan subagent prompt (636 tokens), explore subagent prompt (494 tokens), task subagent prompt (294 tokens), approximately 40 system reminders, and tracking across 133 versions.
Mapping Claude Code subagents to the pipeline
| Claude Code component | ElixirData mapping | Implementation |
|---|---|---|
| Plan subagent (636 tok) | Tier 1 — Planner | Adapted as system prompt for Qwen3-Coder-480B. Produces Plan.MD as standalone artifact. |
| Explore subagent (494 tok) | Tier 1 — Planner explore mode | Read-only research mode for understanding codebase structure before planning. |
| Task subagent (294 tok) | Tier 2 — Generator | Receives Plan.MD sections as individual tasks. Executes with TodoWrite tracking. |
| Core prompt (2,896 tok) | All tiers — adapted per tier | Conciseness mandate, tool-first approach, file reference format across all models. |
Key patterns adapted from Claude Code
Conciseness mandate: Claude Code instructs the model to "answer concisely with fewer than 4 lines unless the user asks for detail" and to "minimize output tokens." Directly adopted — verbose output wastes developer time and context window space.
TodoWrite pattern: Claude Code forces the agent to maintain a task list. ElixirData implements this at the pipeline level: Plan.MD is converted into a todo list that the generator works through sequentially, with real-time progress visible to the developer.
Hooks system: Claude Code triggers scripts at lifecycle events. ElixirData implements: post-edit hooks (run linter), post-generation hooks (trigger tests), post-session hooks (log to feedback flywheel), and index-update hooks (re-index after merge).
The meta-strategy: Use Claude itself (via API, outside the air gap for non-sensitive work) to write and refine the CLAUDE.md, rules, skills, and system prompts that the local models execute. Using the strongest model to configure the others is the highest-leverage approach.
The convention hierarchy
CLAUDE.md (always loaded): Project-wide conventions — tech stack with versions, directory structure, naming conventions, coding standards. Start with 5–10 conventions and build iteratively.
Path-scoped rules (loaded per file type): Convention guidance triggered by file glob patterns. Example: *.test.ts triggers test-specific rules about Vitest. Keeps always-loaded CLAUDE.md small.
Skills (lazy-loaded on demand): Rich bundles of instructions, documentation, scripts, and examples. Each has a SKILL.md with a description the model uses to decide when to load it.
Infrastructure
Hardware specification
| Component | Specification | Purpose |
|---|---|---|
| GPU nodes (Planner + Gen) | 2x NVIDIA A100 80GB or H100 | One for planner, one for generator. MoE architectures use VRAM efficiently. |
| GPU node (Completer) | 1x A100 80GB or RTX 4090 24GB | Dedicated to inline completions. RTX 4090 is the cost-optimized option. |
| CPU server | 32+ cores, 128GB RAM, 2TB NVMe | Context OS: indexer, Qdrant, API gateway, monitoring, embedding model. |
| Storage | NAS, 5TB minimum | Model weights, index persistence, audit logs. |
| Network | 10GbE between nodes | Low-latency internal communication. |
Total hardware estimate: $50,000–$120,000 depending on GPU selection. The software stack is entirely open-source: vLLM for inference, Tree-sitter + Qdrant for indexing, ripgrep for search, OpenTelemetry for monitoring.
Security architecture
Network isolation: Dedicated VLAN, no route to internet. All processing stays on-premise.
Authentication: LDAP/Active Directory integration. API tokens scoped to team and repository access.
Audit logging: All prompts and responses logged to tamper-evident store.
Model provenance: Cryptographic hash verification at deployment. Chain of custody for all weights.
Code isolation: No execution outside inference sandbox. Tool use mediated with explicit permissions.
The Feedback Flywheel
The system includes a closed-loop improvement mechanism that compounds quality over time, even without model upgrades.
Capture: Every interaction is logged — the request, Plan.MD, generated code, and the developer's response (accept / modify / reject).
Analyze: Weekly automated analysis identifies recurring failure patterns. Example: "73% of rejections in payments module involve incorrect webhook handling."
Improve: Failure patterns become targeted context improvements: planner failures → better CLAUDE.md or skills; generator failures → specific path-scoped rules; library misuse → updated documentation packs.
Measure: Track first-pass acceptance rate by tier. Planner acceptance dropping? Planner context needs attention. Generation dropping with good plans? Generator rules need tuning.
Why it compounds: every convention rule added is a failure mode eliminated permanently. Every documentation pack updated is a class of hallucination removed. After 90 days, the rules file will be smaller, more targeted, and more effective than anything written on day one.
Nine Best Practices
- Invest in planning quality above all else. If Plan.MD is accurate, the choice of generation model barely matters. A detailed, correct plan with specific file references, pattern examples, and edge case coverage will produce good code from any competent model.
- Make Plan.MD a reviewable artifact. Plan.MD should be visible to the developer before generation begins. Correcting a wrong plan takes seconds; correcting wrong code from a wrong plan takes minutes or hours.
- Use a three-model strategy. Completion, generation, and planning are fundamentally different cognitive tasks. Running all three on a single model forces compromises in every direction. MoE architectures make multi-model strategies practical.
- Build Claude Code patterns into local models. Claude Code's system prompts represent thousands of hours of prompt engineering. Do not reinvent them — adapt them. Use Claude itself to write the prompts your local models execute.
- Build the rules file iteratively. Start with 5–10 conventions. Deploy. Observe rejections. Add targeted rules. Remove rules that never trigger. After 90 days the file will be organically tuned to actual failure modes.
- Version-pin everything. Every documentation pack must be version-pinned to the exact library version in the lockfile. Stale documentation — even by one minor version — can be worse than none.
- Measure first-pass acceptance rate. Lines generated and tokens processed are operational metrics. First-pass acceptance rate captures whether the system produces useful code. Target: above 60% after 6 months.
- Deploy in three phases. Phase 1 (weeks 1–4): Completer only with 5–8 pilot developers. Phase 2 (weeks 5–10): Full pipeline with Plan.MD. Phase 3 (weeks 11–16): Feedback flywheel operational, skills library, performance tuning.
- Plan for quarterly model refreshes. The open-source landscape evolves rapidly. Each refresh: validate on a subset, compare acceptance rates, cut over only when the new model matches or exceeds production performance.
Conclusion
The most important property of this architecture is that it compounds over time, even without model upgrades. Every convention rule added eliminates a failure mode permanently. Every Plan.MD template refined produces better plans. Every feedback cycle converts developer frustration into systematic improvement.
The open-source ecosystem has delivered models capable of running this pipeline today. Claude Code's published system prompts provide the behavioral blueprint. The three-model architecture decomposes the problem into stages matching how experienced developers actually work. And Plan.MD — a simple markdown file specifying what to build before anything gets built — turns out to be the highest-leverage artifact in the entire system.
The thesis: The model is the engine. Context is the fuel. The plan is the route. Better fuel and a better route produce better output — regardless of the engine.
References
- Claude Code System Prompts: github.com/Piebald-AI/claude-code-system-prompts (v2.1.84, March 2026)
- Qwen3-Coder: github.com/QwenLM/Qwen3-Coder (Apache 2.0, 480B / 35B active)
- GLM-5: Zhipu AI (MIT License, 744B / 40B active, 77.8% SWE-bench)
- Qwen3-Coder-Next: Alibaba (Apache 2.0, 80B / 3B active)
- Kimi K2.5: Moonshot AI (MIT License, 1T / 32B active, 99% HumanEval)
- DeepSeek-V3.2: DeepSeek (MIT License, 685B / 37B active)
- Advanced Context Engineering: Dex Horthy, HumanLayer, August 2025
- Codified Context: Aristidis Vasilopoulos, arXiv, February 2026
- Coding Agents in Feb 2026: Calvin French-Owen, February 2026
Related Resources
- Defining Context OS — The Definitive Article
- Context Engineering Is Necessary But Not Sufficient
- Your AI Agent Is Failing in Production: 9 Reasons, None Are the LLM
- What Is Context OS? — The Complete Guide
- The Decision Gap: Why Enterprise AI Agents Fail in Production
See Context OS in action
Request an Executive Briefing to see how Context OS governs AI agents from coding to enterprise execution.
Request Executive Briefing → demo.elixirdata.co
ElixirData | Context OS™ | The governed operating system for enterprise AI agents