Routing DSL: compose a panel of models that thinks like Fable 5
For two years the playbook for "more intelligence" has been "wait for the next model." We think that's the wrong unit of progress. The frontier isn't a single checkpoint — it's a panel. Give three good models the same hard problem, let them disagree, and arbitrate between the answers, and the panel beats any one of its members. Often it beats the next model up the price chart.
The Routing DSL is how you build that panel. It's a programmable routing strategy — YAML + CEL — that turns your OrcaRouter endpoint into an inference graph: route by difficulty, route by task, fan out to several models at once, judge or vote on their outputs, fall back when confidence is low, and tune the whole thing for cost, latency, or quality. You write rules; the gateway compiles and runs them on every request in ~5 ms.
This post is the engineering tour: the grammar, the variables you can branch on, the four arbiters, the cascade, and a complete production ruleset at the end.
The result first
Two illustrative benchmarks. (Numbers are illustrative — they're meant to show the shape of the effect, not to be quoted as official scores.)
Frontier comparison — a difficulty-routed DSL endpoint vs. the solo frontier:

Fusion panels vs. solo models — scored on 93 of 100 tasks (from OpenRouter):

Three things worth staring at:
Every fusion panel beats every one of its own members. Opus 4.8 + GPT-5.5 (~67.5%) clears both Opus solo (~58.5%) and GPT-5.5 solo (~60%) by 7–9 points. Disagreement is signal; arbitration harvests it.
Fusion reaches the next tier. Three different panels cross Fable 5 solo (~65.5%) using only the models below it.
You don't need expensive members. Opus + Opus self-fusion (~65.5%) matches Fable 5 with one model and a sampler. A panel of cheap models — Gemini 3 Flash + Kimi K2.6 + DeepSeek V4 Pro (~64.5%) — lands a hair under Fable 5 at a fraction of the per-token cost. That's the whole thesis: buy intelligence with topology, not with the next price tier.
The Routing DSL is the control surface that lets you spend that topology only where it pays — cheap models on the easy 80%, a fusion panel on the hard tail.
The grammar in 30 seconds
A ruleset is version, a list of rules, and a required default. Rules are evaluated top to bottom; the first when: that's true wins. No when: means "always match."
version: 1
rules:
- id: only_rule
use: { model: "claude-sonnet-4-6" }
default:
delegate: balancedThe when: is a CEL boolean expression — sandboxed, RE2-only regex, no loops, no I/O, microsecond evaluation, with a single 5 ms deadline shared across the whole ruleset. The use: is the effect: where the request goes and how it's tuned. Limits are deliberately small (≤30 rules, ≤16 KiB of source, ≤200 chars per when:) so a ruleset stays auditable.
Primitive 1 — route by difficulty and task
The distributor classifies every request before routing and exposes the features to CEL. You branch on them directly:
version: 1
rules:
- id: hard_reasoning
when: difficulty > 0.8
use:
model: "claude-opus-4-8"
reasoning_effort: "high"
thinking_budget_tokens: 32000
- id: code_path
when: task_class == "code" && code_keyword_density > 0.5
use: { model: "gpt-5.5" }
- id: cheap_chat
when: difficulty < 0.3
use: { model: "gemini-3-flash" }
default:
delegate: balancedThe variables you can read in when: (abbreviated — see the full reference in the docs):
GroupExamples
Request shape
request.input_tokens, request.output_max_tokens, request.stream, request.vision, request.message_count, request.has_toolsClassification
task_class (chat/code/agent/vision/audio/rag/creative), difficulty (0.0–1.0), code_keyword_density, reasoning_cue_count, log_prompt_tokens, tool_countSession
agent_state.turn, agent_state.tools_used, agent_state.has_edited, agent_state.last_test_failed, agent_state.consecutive_errors, agent_state.models_triedContext
headers["x-…"], user.group, token.name, time.hour, workspace.id
…plus six macros for the things regex-over-payload is good at: system_prompt_matches(re), user_message_matches(re), tool_definitions_include(name), tool_calls_present_any([…]), tool_results_from_any([…]), header_matches(name, re).Any destination can carry per-call knobs, translated to each provider's native params by the relay adapter: reasoning_effort (low/medium/high), thinking_budget_tokens (1024–64000), samples (1–16), temperature (0.0–2.0), plus denylist-guarded param_override / header_override. That's already enough to build the difficulty-routed endpoint from Table A: cheap model on the easy tail, Opus with a thinking budget on the hard one.
Primitive 2 — fan out to a panel (fusion)
This is where the benchmark lift comes from. A parallel: effect dispatches the request to 2–5 legs concurrently, then an arbiter decides what the client actually sees:
- id: hard_tail_panel
when: difficulty > 0.7 && task_class == "agent"
use:
parallel:
- { model: "anthropic/claude-opus-4-8", reasoning_effort: "high" }
- { model: "openai/gpt-5.5", thinking_budget_tokens: 16000 }
- { model: "google/gemini-3.1-pro", temperature: 0.3 }
arbiter:
strategy: best_of_n
model: "anthropic/claude-sonnet-4-6" # the judge
template: judge_code
max_latency_ms: 120000
on_disagreement: # majority-only escape hatch
model: "anthropic/claude-opus-4-8"
reasoning_effort: "high"Four arbiter strategies, each a different answer to "whose output wins?":
first — race the legs, serve the first success, cancel the losers. Optimizes latency (you get the fastest of N).
majority — structured vote across the legs' outputs, no extra model call. When the legs split with no strict majority, the optional on_disagreement: branch re-dispatches a fresh, stronger attempt instead of serving a tie-break. Optimizes robustness on tasks with a canonical answer.
best_of_n — an LLM judge reads all candidates and ranks them. This is the Opus + GPT-5.5 → judge configuration from Table B. Optimizes quality on open-ended work; falls back to first-successful if the judge errors.
tests_pass — execution-grounded: serve the candidate whose patch actually makes the test suite pass. No judge guessing — the harness decides. This is the strongest arbiter for code/agent work. The verifier lives outside the gateway (wired via a VerifierProvider); with none wired, it degrades to first-successful.
max_latency_ms (1000–600000, default 120000) caps the fan-out so one slow leg can't stall the response — laggards are dropped. Nesting parallel inside parallel is rejected at lint; the panel is intentionally one level deep.
Availability note: the N-way fan-out runtime is gated behind the server flag ROUTING_DSL_ENSEMBLE_RUNTIME while per-leg billing is hardened on staging — that's why fusion is preview, not GA. With the flag off, a parallel: rule cleanly serves its first leg, so you can author and shadow your panels today and flip them on when fusion lands in your region.
Primitive 3 — fallbacks and confidence cascades
Fan-out spends N× up front. A cascade spends extra only when the first answer looks wrong. After the response, on_low_confidence: evaluates signals and, if one fires, re-dispatches to a stronger destination:
- id: agent_with_safety_net
when: task_class == "agent"
use:
pool: "@pool:fast"
on_low_confidence:
signals: [patch_invalid, self_doubt, next_turn_test_failed]
threshold: { low_logprob: -1.5 }
use:
model: "claude-opus-4-8"
reasoning_effort: "high"The signals: patch_invalid (the diff fails git apply --check), self_doubt (a hedging-phrase regex set), low_logprob (mean token logprob under threshold, where the provider exposes it), and next_turn_test_failed (a cross-turn latch — this turn's prompt carries the shape of last turn's failing tests). Cascades are depth-1 by design. Pair them with agent_state.models_tried to get diversity on retry — never send the repair to the model that just failed.
Tuning the dial: cost, latency, quality
The same DSL expresses all three objectives; you choose per rule:
Cost — delegate: cheapest, keep the cheap model on the easy tail, and reserve fan-out for difficulty > 0.7. Table B's cheap panel (~64.5% ≈ Fable 5 solo) is the existence proof: a fusion of small models can replace a frontier model at a fraction of per-token cost. Be clear-eyed, though — fusion uses the "bill every leg" model: a 3-leg best_of_n panel bills three candidates plus the judge. The economics work because you (a) only fan out on the hard minority of requests and (b) fuse cheaper members than the frontier model you're replacing.
Latency — arbiter: { strategy: first } plus a tight max_latency_ms gives you the fastest of N with a hard ceiling.
Quality — best_of_n for open-ended work, tests_pass when there's a suite to ground on. samples and thinking_budget_tokens buy more within a single leg.
Operating it without breaking prod
Routing changes are scary, so the DSL ships with the safety rails an SRE expects:
Lint on every save — schema, CEL type-check (every when: must evaluate to bool), ref resolution, knob ranges, header/param denylists. Errors come back as {line, column, message, rule} and render as gutter chips in the editor.
Dry-run — POST a synthetic request (task_class, difficulty, agent_state, …) and get back the matched rule, the resolved effect, and the eval time before anything ships.
Shadow mode — for 24 h after the first save the DSL is evaluated but not used; a shadow log records would-be picks and the console shows a diff (percent of routes changed, projected daily cost delta, per-rule fire counts).
Canary — a 0–100 traffic slider. Ramp 5 → 25 → 50 → 100 watching per-slice metrics; roll back by sliding to 0.
Audit + rollback — every save/rollback writes an audit row in the same transaction; concurrent edits get a 409 with the current version so you retry against fresh state.
Test cases, trace replay, and an AI "explain this ruleset" view round it out. You find it in the dashboard under routing → strategy → DSL.
A complete ruleset
Cheap on easy, mid on medium, a judged fusion panel on the hard agentic tail, with a confidence cascade underneath:
version: 1
rules:
- id: trivial
when: difficulty < 0.3 && !has_tools
use: { model: "gemini-3-flash" }
- id: standard
when: difficulty < 0.7
use:
model: "gpt-5.5"
on_low_confidence:
signals: [self_doubt, low_logprob]
use: { model: "claude-opus-4-8", reasoning_effort: "high" }
- id: hard_agent_panel
when: difficulty >= 0.7 && task_class == "agent"
use:
parallel:
- { model: "anthropic/claude-opus-4-8", reasoning_effort: "high" }
- { model: "openai/gpt-5.5", thinking_budget_tokens: 16000 }
- { model: "google/gemini-3.1-pro" }
arbiter:
strategy: tests_pass # execution-grounded; judged fallback if no harness
max_latency_ms: 180000
on_disagreement:
model: "claude-opus-4-8"
reasoning_effort: "high"
default:
delegate: balancedThat endpoint is the one that sits at the top of Table A — not because it found a better model, but because it spends the right model on the right request and fuses a panel exactly where the panel wins.
Start composing
The next jump in capability doesn't have to wait for the next checkpoint. It's a graph you can write this afternoon: route by difficulty, fan out on the hard tail, judge or test the outputs, cascade when confidence dips.
Docs: https://docs.orcarouter.ai/routing/routing-dsl
UI: routing → Create router -> Routing strategy → DSL (expert)
The frontier is a panel. Go build yours.
