Building an Offline AI Coding Agent: Three-Model Architecture, Context OS, and the Plan.MD Pipeline

Written by Navdeep Singh Gill | Mar 30, 2026 4:17:05 AM

Three-model architecture, Context OS, and the Plan.MD pipeline

By ElixirData Engineering | March 2026

Key takeaways

AI coding assistants fail because of context infrastructure, not model capability — 120x token reduction from structured indexing vs file-by-file grep
A three-model pipeline (Planner → Generator → Completer) decomposes the problem into stages matching how experienced developers actually work
Plan.MD — a detailed, human-reviewable specification produced before any code is generated — is the highest-leverage artifact in the system
Context OS orchestrates context across all three tiers: token budget as memory, retrieval as I/O, model routing as process scheduling
The entire system runs offline in an air-gapped network using open-source models — zero data leaves the perimeter

Why AI Coding Assistants Fail

Most teams adopting AI coding assistants in 2026 follow the same playbook: subscribe to a cloud API, plug it into the IDE, and hope for the best. The results are often underwhelming. The AI generates code that does not fit the codebase, calls APIs that do not exist, and applies patterns that were deprecated months ago. Developers spend as much time correcting the output as they would have spent writing it themselves.

This cycle is familiar enough that it has a name: the Frustration Loop. Generate code, review it, find it does not fit, regenerate with corrections, review again, eventually accept heavily-modified output or abandon the attempt entirely.

ElixirData faced these problems — and then added another constraint: the entire system has to run offline, inside an air-gapped development network, with zero data leaving the perimeter. No cloud fallback. No external API calls. No live documentation retrieval.

This constraint turned out to be clarifying. It forced the team to confront a truth that cloud-connected teams can afford to ignore: the quality of an AI coding assistant is determined not by the model it runs, but by the context infrastructure underneath it — and more specifically, by the quality of the plan the model works from.

The thesis: The model is the engine. Context is the fuel. The plan is the route. Better fuel and a better route produce better output — regardless of the engine.

Two context gaps, same failure mode

A typical AI coding assistant operates with two sources of knowledge: training data (static, frozen at a cutoff date) and whatever context fits in the current prompt (dynamic but shallow). This creates two distinct failures that produce the same downstream result: bad code.

Internal context failure: The assistant does not understand the codebase. It does not know the project uses Fastify instead of Express, that files go in lib/services/ instead of utils/, or that the codebase follows a functional style. When it needs to understand the authentication flow, it reads files one by one, burning hundreds of thousands of tokens for what a structured index could answer in a few hundred.

External context failure: The assistant does not have current knowledge of libraries and APIs. It falls back on training data — which may be months or years stale. For fast-moving libraries, this means code that calls methods that no longer exist, passes parameters in the wrong order, or misses simpler approaches added in recent versions.

Project-reported benchmarking from codebase-memory-mcp shows approximately 3,400 tokens via structured indexing versus approximately 412,000 tokens via file-by-file grep for the same five structural queries — roughly a 120x reduction.

The cascade problem

These failures compound. As Dex Horthy of HumanLayer articulated: a bad line of code is a bad line of code, but a bad line in a plan leads to hundreds of bad lines of code, and bad research leads to thousands. Upstream errors cascade and amplify downstream.

If planning quality is the primary lever — if the accuracy of the plan determines the quality of everything downstream — then the architecture should optimize for planning quality above all else.

The speed trap

Part of the problem lies in how success is measured. Teams often track time to first output or lines of code generated. If an AI generates 200 lines in seconds but a developer spends 30 minutes refactoring them, the net productivity gain is questionable.

Misleading metric	More useful alternative
Time to first output	First-pass acceptance rate
Lines of code generated	Iteration cycles per task
Tasks completed	Post-merge rework required
Generation speed	Review burden vs manual writing

The Three-Model Architecture

ElixirData's solution inverts the conventional single-model approach. Three models operate in a pipeline, each optimized for a different cognitive task.

Request → [Planner] → Plan.MD → [Generator] → Code → [Completer] → Edits

Tier 1: The planner

Property	Specification
Model	Qwen3-Coder-480B (35B active parameters, MoE)
Role	Research codebase, reason about architecture, produce Plan.MD
License	Apache 2.0
Context window	256K–1M tokens
Latency target	30–60 seconds (thoroughness over speed)
Token budget	Up to 128K (deep codebase understanding)
Cloud alternative	Claude Opus 4.6 for non-sensitive work outside the air gap

The planner is the most important model in the system. Its job is to research the codebase, understand the request, reason about architectural implications, and produce a detailed Plan.MD that specifies exactly what to build, where to build it, which patterns to follow, and which edge cases to handle.

For teams with cloud access for non-sensitive work, Claude Opus 4.6 is the ideal planner. Its adaptive thinking, 500K–1M token context window, and frontier-level reasoning make it the strongest option available. Use it outside the air gap for architecture research and specification generation, then transfer the outputs via secure media.

Tier 2: The generator

Property	Specification
Model	GLM-5 (744B total, 40B active parameters, MoE)
Role	Execute Plan.MD — write the actual code
License	MIT
SWE-bench Verified	77.8% (best open-source)
Latency target	< 15 seconds first token
Token budget	Up to 64K (Plan.MD has narrowed the scope)
Alternatives	Kimi K2.5 (99% HumanEval, 32B active) or DeepSeek-V3.2 (MIT)

The generator receives Plan.MD and executes against it. It does not need to reason about architecture — the planner already did that. It translates a precise specification into correct, convention-following code.

When Plan.MD is accurate — when it correctly identifies files, patterns, edge cases, and conventions — even a mid-tier generation model will produce correct code. The plan has already done the hard cognitive work.

Tier 3: The completer

Property	Specification
Model	Qwen3-Coder-Next (80B total, 3B active parameters, MoE)
Role	Fast inline completions and small edits
License	Apache 2.0
SWE-bench	70.6% (comparable to models 10–20x larger)
Latency target	< 500 milliseconds
Token budget	Up to 8K (immediate editing context only)
What it handles	Ghost text, autocomplete, FIM, single-function refactors

The completer handles 80% of daily interactions by volume. It does not need Plan.MD or full context retrieval — it needs the current file, the imports, and the path-scoped rules.

Plan.MD: The contract between stages

Plan.MD is the critical artifact in the pipeline. It serves as an explicit, inspectable, human-reviewable contract between the planner and the generator. A well-formed Plan.MD includes:

Research findings: What the planner learned about the codebase (with file paths and line numbers)

Files to create and modify: Specific locations, not vague module references

Architecture decisions: Chosen approach with rationale

Edge cases to handle: Failure modes, boundary conditions, exceptional flows

Test plan: What to test and how

Conventions to follow: Referenced from CLAUDE.md with specific examples

The specificity matters. "Modify the auth module" is not a plan. "Add rate limiting check in src/gateway/middleware.py after line 47, following the decorator pattern used in auth.py:23-47" is a plan. The more specific the plan, the less the generator needs to guess — and guessing is where models fail.

Critically, Plan.MD is a human-readable artifact. A developer can review it before generation begins and catch errors at the plan stage, where they are cheap to fix, rather than at the code stage, where they cascade.

The Context Operating System

The Context OS is the orchestration layer that ensures each model sees the right context at the right time. It manages three resources that behave like operating system primitives.

Three resources, three management strategies

Token budget as memory: A model's context window is finite. The Context OS allocates space to the most relevant context, evicts stale information, and ensures totals stay within limits. Dumping everything degrades performance the same way memory swapping does.

Retrieval as I/O: Fetching context from the codebase index, documentation packs, and convention rules is analogous to disk I/O. The Context OS caches frequently-accessed context, prefetches likely-needed documentation, and batches retrieval calls.

Model routing as process scheduling: Completion requests need speed (sub-500ms). Generation requests need quality (15 seconds). Planning requests need thoroughness (30+ seconds). The Context OS routes each request to the appropriate model.

The context pipeline

When a developer makes a request, the Context OS executes six steps:

Request classification: Determine request type. Completions go directly to Tier 3. Generation and review enter the full pipeline.
Convention loading: Load CLAUDE.md (always on), path-scoped rules for relevant file types, and matching skills.
Codebase retrieval: Two-stage: lexical search (ripgrep, milliseconds, 60–70% of queries) then semantic search (Qdrant vectors, 30–40%).
Documentation retrieval: Pull version-pinned library documentation for dependencies involved.
Context assembly: Merge, rank by relevance, trim to token budget. Priority: conventions > code spans > docs > broader context.
Enriched inference: Assembled context combined with request and sent to the appropriate model.

The three context subsystems

Codebase index: Persistent representation of the entire codebase using Tree-sitter for AST parsing and Qdrant for vector storage. Code split at meaningful boundaries — functions, classes, logical blocks. Updated incrementally via Git hooks on every merge to main.

Documentation packs: Curated, version-pinned documentation for key dependencies as markdown files optimized for LLM consumption. Also includes actual library source code for critical dependencies. Version-pinned to exact lockfile versions.

Convention rules: Structured encoding of coding standards following Claude Code's hierarchy: CLAUDE.md (always loaded), path-scoped rules (loaded per file type), and skills (lazy-loaded bundles of instructions and resources).

Building from Claude Code's System Prompts

Claude Code's system prompts are fully documented in the open-source repository Piebald-AI/claude-code-system-prompts. As of March 2026 (version 2.1.84), this includes the core system prompt (2,896 tokens), 18 built-in tool descriptions, plan subagent prompt (636 tokens), explore subagent prompt (494 tokens), task subagent prompt (294 tokens), approximately 40 system reminders, and tracking across 133 versions.

Mapping Claude Code subagents to the pipeline

Claude Code component	ElixirData mapping	Implementation
Plan subagent (636 tok)	Tier 1 — Planner	Adapted as system prompt for Qwen3-Coder-480B. Produces Plan.MD as standalone artifact.
Explore subagent (494 tok)	Tier 1 — Planner explore mode	Read-only research mode for understanding codebase structure before planning.
Task subagent (294 tok)	Tier 2 — Generator	Receives Plan.MD sections as individual tasks. Executes with TodoWrite tracking.
Core prompt (2,896 tok)	All tiers — adapted per tier	Conciseness mandate, tool-first approach, file reference format across all models.

Key patterns adapted from Claude Code

Conciseness mandate: Claude Code instructs the model to "answer concisely with fewer than 4 lines unless the user asks for detail" and to "minimize output tokens." Directly adopted — verbose output wastes developer time and context window space.

TodoWrite pattern: Claude Code forces the agent to maintain a task list. ElixirData implements this at the pipeline level: Plan.MD is converted into a todo list that the generator works through sequentially, with real-time progress visible to the developer.

Hooks system: Claude Code triggers scripts at lifecycle events. ElixirData implements: post-edit hooks (run linter), post-generation hooks (trigger tests), post-session hooks (log to feedback flywheel), and index-update hooks (re-index after merge).

The meta-strategy: Use Claude itself (via API, outside the air gap for non-sensitive work) to write and refine the CLAUDE.md, rules, skills, and system prompts that the local models execute. Using the strongest model to configure the others is the highest-leverage approach.

The convention hierarchy

CLAUDE.md (always loaded): Project-wide conventions — tech stack with versions, directory structure, naming conventions, coding standards. Start with 5–10 conventions and build iteratively.

Path-scoped rules (loaded per file type): Convention guidance triggered by file glob patterns. Example: *.test.ts triggers test-specific rules about Vitest. Keeps always-loaded CLAUDE.md small.

Skills (lazy-loaded on demand): Rich bundles of instructions, documentation, scripts, and examples. Each has a SKILL.md with a description the model uses to decide when to load it.

Infrastructure

Hardware specification

Component	Specification	Purpose
GPU nodes (Planner + Gen)	2x NVIDIA A100 80GB or H100	One for planner, one for generator. MoE architectures use VRAM efficiently.
GPU node (Completer)	1x A100 80GB or RTX 4090 24GB	Dedicated to inline completions. RTX 4090 is the cost-optimized option.
CPU server	32+ cores, 128GB RAM, 2TB NVMe	Context OS: indexer, Qdrant, API gateway, monitoring, embedding model.
Storage	NAS, 5TB minimum	Model weights, index persistence, audit logs.
Network	10GbE between nodes	Low-latency internal communication.

Total hardware estimate: $50,000–$120,000 depending on GPU selection. The software stack is entirely open-source: vLLM for inference, Tree-sitter + Qdrant for indexing, ripgrep for search, OpenTelemetry for monitoring.

Security architecture

Network isolation: Dedicated VLAN, no route to internet. All processing stays on-premise.

Authentication: LDAP/Active Directory integration. API tokens scoped to team and repository access.

Audit logging: All prompts and responses logged to tamper-evident store.

Model provenance: Cryptographic hash verification at deployment. Chain of custody for all weights.

Code isolation: No execution outside inference sandbox. Tool use mediated with explicit permissions.

The Feedback Flywheel

The system includes a closed-loop improvement mechanism that compounds quality over time, even without model upgrades.

Capture: Every interaction is logged — the request, Plan.MD, generated code, and the developer's response (accept / modify / reject).

Analyze: Weekly automated analysis identifies recurring failure patterns. Example: "73% of rejections in payments module involve incorrect webhook handling."

Improve: Failure patterns become targeted context improvements: planner failures → better CLAUDE.md or skills; generator failures → specific path-scoped rules; library misuse → updated documentation packs.

Measure: Track first-pass acceptance rate by tier. Planner acceptance dropping? Planner context needs attention. Generation dropping with good plans? Generator rules need tuning.

Why it compounds: every convention rule added is a failure mode eliminated permanently. Every documentation pack updated is a class of hallucination removed. After 90 days, the rules file will be smaller, more targeted, and more effective than anything written on day one.

Nine Best Practices

Invest in planning quality above all else. If Plan.MD is accurate, the choice of generation model barely matters. A detailed, correct plan with specific file references, pattern examples, and edge case coverage will produce good code from any competent model.
Make Plan.MD a reviewable artifact. Plan.MD should be visible to the developer before generation begins. Correcting a wrong plan takes seconds; correcting wrong code from a wrong plan takes minutes or hours.
Use a three-model strategy. Completion, generation, and planning are fundamentally different cognitive tasks. Running all three on a single model forces compromises in every direction. MoE architectures make multi-model strategies practical.
Build Claude Code patterns into local models. Claude Code's system prompts represent thousands of hours of prompt engineering. Do not reinvent them — adapt them. Use Claude itself to write the prompts your local models execute.
Build the rules file iteratively. Start with 5–10 conventions. Deploy. Observe rejections. Add targeted rules. Remove rules that never trigger. After 90 days the file will be organically tuned to actual failure modes.
Version-pin everything. Every documentation pack must be version-pinned to the exact library version in the lockfile. Stale documentation — even by one minor version — can be worse than none.
Measure first-pass acceptance rate. Lines generated and tokens processed are operational metrics. First-pass acceptance rate captures whether the system produces useful code. Target: above 60% after 6 months.
Deploy in three phases. Phase 1 (weeks 1–4): Completer only with 5–8 pilot developers. Phase 2 (weeks 5–10): Full pipeline with Plan.MD. Phase 3 (weeks 11–16): Feedback flywheel operational, skills library, performance tuning.
Plan for quarterly model refreshes. The open-source landscape evolves rapidly. Each refresh: validate on a subset, compare acceptance rates, cut over only when the new model matches or exceeds production performance.

Conclusion

The most important property of this architecture is that it compounds over time, even without model upgrades. Every convention rule added eliminates a failure mode permanently. Every Plan.MD template refined produces better plans. Every feedback cycle converts developer frustration into systematic improvement.

The open-source ecosystem has delivered models capable of running this pipeline today. Claude Code's published system prompts provide the behavioral blueprint. The three-model architecture decomposes the problem into stages matching how experienced developers actually work. And Plan.MD — a simple markdown file specifying what to build before anything gets built — turns out to be the highest-leverage artifact in the entire system.

The thesis: The model is the engine. Context is the fuel. The plan is the route. Better fuel and a better route produce better output — regardless of the engine.

References

Claude Code System Prompts: github.com/Piebald-AI/claude-code-system-prompts (v2.1.84, March 2026)
Qwen3-Coder: github.com/QwenLM/Qwen3-Coder (Apache 2.0, 480B / 35B active)
GLM-5: Zhipu AI (MIT License, 744B / 40B active, 77.8% SWE-bench)
Qwen3-Coder-Next: Alibaba (Apache 2.0, 80B / 3B active)
Kimi K2.5: Moonshot AI (MIT License, 1T / 32B active, 99% HumanEval)
DeepSeek-V3.2: DeepSeek (MIT License, 685B / 37B active)
Advanced Context Engineering: Dex Horthy, HumanLayer, August 2025
Codified Context: Aristidis Vasilopoulos, arXiv, February 2026
Coding Agents in Feb 2026: Calvin French-Owen, February 2026

Related Resources

Defining Context OS — The Definitive Article
Context Engineering Is Necessary But Not Sufficient
Your AI Agent Is Failing in Production: 9 Reasons, None Are the LLM
What Is Context OS? — The Complete Guide
The Decision Gap: Why Enterprise AI Agents Fail in Production

See Context OS in action

Request an Executive Briefing to see how Context OS governs AI agents from coding to enterprise execution.

Request Executive Briefing → demo.elixirdata.co

ElixirData | Context OS™ | The governed operating system for enterprise AI agents

View full post