Spike: warp-drive per-chunk model routing
Status: Complete — recommendation below. Question (#379): Should warp-drive run a cheap pre-pass that routes each chunk to the cheapest sufficient model tier, escalating to a strong model only when warranted? Origin: Capability #376 — applying the
llm-tieringprinciple to BoB’s own token-heavy orchestration. BoB is the POC. Date: 2026-06-21.
Conditional GO — but not the obvious design. The naive idea (“assess chunk complexity, switch the main-loop model”) is both infeasible in the harness and aimed at the wrong cost. Measured data from this capability’s own warp-drive session shows the dominant per-chunk cost is cache-read of the accumulating session context (67–81%), not fresh reasoning (output: 15–25%). Switching the model the main loop runs under — even if the harness allowed it mid-session, which it does not — would only touch the 15–25% slice.
The lever that actually moves the dominant cost is isolated-context subagent delegation: run a mechanical chunk in a fresh, minimal Task subagent at a cheap tier. That resets the cache-read burden and drops the output rate. This is a real, scoped opportunity — worth a piloted follow-up, not a blind build. Follow-up filed: see §7.
1. Where a routing decision could sit
Section titled “1. Where a routing decision could sit”warp-drive runs inside the Claude Code harness: the “model” doing chunk work is the session’s main-loop model, not a raw SDK call the loop controls. That constrains the design space to three options:
| Option | Mechanism | Feasible? |
|---|---|---|
| A. Switch the main-loop model per chunk | Change the session model mid-loop based on a complexity verdict | No — the harness does not expose mid-session model switching to the loop. A session runs under one model. |
| B. Cheap pre-pass gates, main loop still codes | A Haiku assessor labels the chunk; the main loop proceeds anyway | Low value — the assessment can’t change who does the work (see A), so it saves nothing; it only adds a call. |
| C. Delegate the chunk to an isolated subagent at a chosen tier | Main loop becomes an orchestrator: a cheap pre-pass picks a tier, the chunk is implemented by a Task subagent spawned at that tier in fresh context |
Yes — the Agent/Task tool already takes a per-agent model. This is the only option that both works and attacks the dominant cost. |
So “per-chunk model routing” in warp-drive is really “per-chunk execution delegation.” The
routing decision sits at the start of the coding phase: assess → pick tier → either keep
the chunk on the main loop (hard/interdependent) or hand it to a cheap isolated subagent
(mechanical/self-contained).
2. The measured baseline (real chunks, this session)
Section titled “2. The measured baseline (real chunks, this session)”The three requirements that built this capability (#377 skill, #378 runbook, #380 agent
guidance) ran as warp-drive chunks on the session model (Opus 4.8, 1M context). Token
deltas captured by the warp-drive token-snapshotter, costed with the repo’s own pricing
model (scripts/warp-drive/token-report.js: output $75, cache-read $1.50, cache-create
$18.75 per MTok):
| Chunk | Output tok | Cache-read tok | Cost | Cache-read share | Output share |
|---|---|---|---|---|---|
| #377 skill | 55,068 | 7.32 M | $16.36 | 67% | 25% |
| #378 runbook | 13,063 | 2.75 M | $5.47 | 76% | 18% |
| #380 agent guidance | 1,218 | 0.32 M | $0.59 | 81% | 15% |
| Session total | ≈ $22.42 | ~70% | ~22% |
Every one of these was markdown authoring — squarely within Sonnet’s capability, and the mechanical parts (frontmatter, checklists, cross-links) within Haiku’s. They ran on the most expensive tier by default, because warp-drive has no notion of tiering. That is the waste this spike set out to size.
3. The key finding: cache-read dominates
Section titled “3. The key finding: cache-read dominates”The headline isn’t “we used Opus for doc work” — it’s where the money went. Output tokens (the part a cheaper model would bill less for) are only 15–25% of per-chunk cost. The 1M-context session re-reads its entire accumulating context every turn, so cache-read is 67–81% and grows with the conversation.
Consequences:
- Same-session model swaps are nearly pointless. Even if option A were possible, cutting the output rate (e.g. Opus→Sonnet, $75→$15/MTok) on a 20%-of-cost slice yields ~16% savings at best, while the 70% cache-read slice is untouched.
- Isolated context is the real prize. A subagent that starts fresh carries kilobytes of context, not 7 megatokens. Delegating the #377 chunk to a fresh Sonnet subagent would have attacked both levers: a tiny cache-read footprint and a 5× cheaper output rate.
- Difficulty ≠ cost here. Cost tracked context size and turn count, not the intrinsic hardness of the chunk. That’s the opposite of the bodfeed gate’s setting (where the expensive call is the cost). The discipline transfers; the cost model does not.
4. Projected delta (and why it’s a projection, not a controlled A/B)
Section titled “4. Projected delta (and why it’s a projection, not a controlled A/B)”A rigorous answer needs both arms measured: the same chunk run on the main loop and via a cheap isolated subagent, with quality judged equal. This spike measured the gate-off arm for real (§2) but did not run a controlled gate-on arm — the chunks were already complete, and a fair both-arms trial is itself the first task of the pilot. So the delta below is a projection from tier pricing + isolated-context, labelled as such:
- Mechanical chunk on the main loop (measured): the #377-class chunk ≈ $16, ~70% cache-read.
- Same chunk in a fresh Sonnet subagent (projected): small context (≪1 MTok cache-read) + Sonnet output rate → order-of-magnitude lower, plausibly $1–3.
- The projection’s load-bearing assumption — quality holds on mechanical chunks — is exactly what the pilot must verify, not assume.
No silent cap: this spike did not run the gate-on arm. Any figure above is an estimate; treat the controlled A/B as unfinished work owned by the follow-up.
5. Trade-offs and risks of delegation (option C)
Section titled “5. Trade-offs and risks of delegation (option C)”- Context loss is the real risk. A subagent in fresh context can’t see the plan, prior chunks’ decisions, or the conversation. warp-drive chunks are often interdependent (chunk 3 builds on chunk 1). Delegating an interdependent chunk blind invites rework that erases the saving. Mitigation: only delegate self-contained, mechanical chunks, and pass a tight context brief into the subagent prompt.
- Quality verification cost. A cheap subagent’s output must be checked. If the main loop
re-reads and re-reasons over everything the subagent did, some saving is clawed back. The
fail-open discipline from
llm-tieringapplies: on low subagent confidence or a failed check, the main loop takes the chunk back — the strong path stays the backstop. - State-machine complexity. The
codingphase gains an assess→route fork and a delegate-or-keep branch. Worth it only if the saving clears that added surface. - Classifier error. A cheap assessor that mislabels a hard chunk as mechanical is the dangerous failure. Bias the classifier toward “keep on main loop” (escalate-on-doubt), the mirror of the gate’s lenient-on-signal stance.
6. How this maps back to llm-tiering
Section titled “6. How this maps back to llm-tiering”The pattern’s discipline transfers cleanly; its cost model does not. Same: cheap pass routes, expensive path is the backstop, fail-open on doubt, measure before trusting, disableable by flag. Different: bodfeed’s expensive cost is the call; warp-drive’s is the context. So the warp-drive application is “tier where the work runs (isolated vs. main loop),” not “tier which model the call uses.” Capturing that distinction is the main intellectual output of this spike.
7. Recommendation & follow-up
Section titled “7. Recommendation & follow-up”Conditional GO, scoped tightly:
- Pilot option C (isolated cheap-tier subagent delegation) on a whitelist of provably mechanical, self-contained chunk types only — e.g. seed-data generation, doc count/ cross-link fixes, lint/format chunks. Never on interdependent or design chunks.
- Gate it behind a
_workflowflag (default off), with the main loop as the fail-open backstop — identical-end-state when disabled, per thellm-tieringsafety rules. - First task of the pilot is the controlled A/B this spike deliberately left unrun: same mechanical chunk, main loop vs. delegated subagent, token/cost delta recorded with a quality judge confirming parity.
A follow-up requirement carrying this design has been filed under #376. If the pilot’s A/B shows the saving doesn’t survive context-brief overhead and verification cost, that becomes a no-go lesson — recorded then, on real both-arms data, rather than guessed now.
8. A/B results (#384) — GO confirmed
Section titled “8. A/B results (#384) — GO confirmed”The controlled A/B the spike deferred has now been run (req #384). A representative whitelisted-mechanical task — generate an idempotent SQLite seed file covering all lifecycle states — was executed two ways, with both arms measured from real transcripts and an independent judge subagent confirming output parity.
| Arm | Execution mode | Measured tokens (in / out / cache-read / cache-write) | Cost* | Cache-read share |
|---|---|---|---|---|
| B — delegated | Haiku 4.5, isolated subagent | 11 / 500 / 20,361 / 44,126 | $0.060 | 3% |
| comparison | Sonnet 4.6, isolated subagent | 5 / 546 / 43,393 / 23,696 | $0.110 | — |
| A — status quo | Opus 4.8, main loop (measured §2, small chunk) | — / 1,218 / 318,462 / 1,262 | $0.593 | 81% |
| A — status quo | Opus 4.8, main loop (measured §2, mid chunk) | 425 / 13,063 / 2,754,707 / 18,634 | $5.468 | 76% |
* Claude 4.x list pricing per MTok — Opus 15/75/1.50/18.75, Sonnet 3/15/0.30/3.75, Haiku 1/5/0.10/1.25.
Findings:
- 10–92× cheaper. The delegated Haiku arm cost $0.06 vs. $0.59–$5.47 for the equivalent main-loop Opus chunk — a 9.9× saving against the smallest measured mechanical chunk, 91.6× against a mid-size one.
- The lever is context, confirmed. Cache-read is 76–81% of main-loop cost but only 3% of the isolated subagent’s — the subagent reads ~20k tokens of fresh context, not the ~0.3–2.8M the main loop re-reads. This is the spike’s §3 thesis, now measured on both arms.
- Quality holds. Both arms produced valid, idempotent SQL covering all six statuses with
correct conventions; both executed and re-applied cleanly against the real schema; an
independent judge rated them equivalent (
parity: true). Mechanical work does not need the expensive tier. - Honest caveat. Arm A is real historical main-loop measurement (§2), not a fresh isolated re-run — reproducing an isolated main-loop measurement is polluted by orchestration overhead. The direction and magnitude are not in doubt; the exact multiplier varies with chunk size.
Decision: GO. The saving is real, large, and survives quality parity. Implemented behind
the delegate_mechanical_chunks flag (default off) — see commands/warp-drive.md coding
phase and WORKFLOW_DEFAULTS. The no-go lesson path (§7) is therefore not triggered.
Appendix — reproducing the baseline
Section titled “Appendix — reproducing the baseline”# Per-chunk token deltas are captured live in the warp-drive state file:jq '.token_usage.chunk_snapshots' .claude/.warp-drive-state.json# Cost model + aggregate reporting across sessions:node scripts/warp-drive/token-report.js