Model Fusion in Production: Inside OrcaRouter Fusion and the Routing DSL

Model Fusion in Production: Inside OrcaRouter Fusion and the Routing DSL

تاريخ النشر

العودة إلى جميع المقالات

Three frontier models in parallel, one answer back. Call it in one line — or compose your own.

TL;DR. Claude Fable 5 was delisted. The answer isn't a bigger model — it's a panel: run several frontier models in parallel and let a judge return the strongest answer. OrcaRouter ships this two ways: built-in orcarouter/fusion routers you call like any model, and a Routing DSL to compose your own. This is the field guide to both — with copy-paste recipes, the five arbiters (including synthesize, the Mixture-of-Agents fuse), and how to roll it out without betting your SLA.


Part 1 — Call it in one line: the built-in Fusion routers

Fable 5 was discontinued and restricted, so it's no longer broadly callable. Fusion rebuilds that level from the models you can still call — a drop-in, OpenAI-compatible router that runs a panel of frontier models in parallel and returns the strongest answer. Three curated tiers ship in every workspace:

The three Fusion tiers (panel composition × context window)


orcarouter/fusion

Claude Opus 4.8 + GPT-5.5 + Gemini 3.1 Pro

Context Window: 1,000,000

Best for: Fable-5 level Max intelligence


orcarouter/fusion-mini

Claude Opus 4.8 + GPT-5.5

Context Window: 1,000,000

Best for: Fable-5 level balanced inference


orcarouter/fusion-flash

Gemini 3.5 Flash + MiniMax M2.7 + GLM 5.1

Context Window: 200,000

Best for: Fable-5 level fast + cheap inference

(Context window = the smallest panel member's — the binding constraint on a fan-out.)


These aren't marketing slugs; they're pre-compiled DSL routers managed centrally. Here's the actual orcarouter/fusion program, verbatim:

version: 1
rules:
  - id: hard_panel
    when: task_class == "code" || task_class == "agent" || code_keyword_density >= 0.3 || has_tools || difficulty >= 0.3
    use:
      parallel:
        - { model: "anthropic/claude-opus-4.8" }
        - { model: "openai/gpt-5.5" }
        - { model: "google/gemini-3.1-pro-preview" }
      arbiter:
        strategy: best_of_n
        model: "anthropic/claude-opus-4.8"
        template: best_answer_v1
      max_latency_ms: 120000
default:
  delegate: balanced


Two design choices worth calling out:

It only fans out on real work. The when: gate fires the panel for code, agent, tool-using, code-dense, or high-difficulty (difficulty >= 0.3) prompts; everything else falls through to the workspace's balanced default. You pay panel price exactly where it helps, not on "hi."

The judge serves a real answer, verbatim. best_of_n runs an LLM judge (here, Opus 4.8 with the best_answer_v1 template) that picks the single strongest candidate and serves it as-is — never a diluted merge. The output is always a real model's answer.


Part 2 — Select vs. Fuse: best_of_n and the synthesize arbiter

The Fusion routers select. But OrcaRouter also ships a fuse strategy — synthesize, the Mixture-of-Agents pattern added in the routing engine (service/dispatch_parallel/synthesize.go). The difference is the whole game:

Exhibit 2 — Select vs. Fuse

best_of_n (SELECT)                         synthesize (FUSE)
 ┌─ Opus 4.8  ─┐                            ┌─ Opus 4.8  ─┐
 ├─ GPT-5.5   ─┼─► judge picks leg k        ├─ GPT-5.5   ─┼─► aggregator LLM writes
 └─ Gemini    ─┘   └─► serve leg k verbatim └─ Gemini    ─┘   ONE new fused answer
   output = a real model's answer             output = a new answer better than any leg

Recipe for true fusion:

use:
  parallel:
    - { model: "anthropic/claude-opus-4.8" }
    - { model: "openai/gpt-5.5" }
    - { model: "google/gemini-3.1-pro-preview" }
  arbiter:
    strategy: synthesize
    model: "anthropic/claude-opus-4.8"   # aggregator: fuses candidates into one new answer
    template: synthesize_v1


Honest caveats:

- Billing is N+1 — every leg bills, plus the aggregator as an extra call.

- OpenAI chat-format only in V1 — the aggregator emits an OpenAI chat completion; Claude/Gemini native clients degrade to serve-first-successful (legs still billed).

The aggregator must be in the router's authorized candidate set, or it degrades.

When to use which: best_of_n when one model's answer is likely fully right (code, factual Q&A) — you want a clean, real answer. synthesize when answers are complementary (research, analysis, long-form) and merging strengths beats any single take.


Part 3 — Build your own: the Routing DSL playbook

Don't want the curated panel? Start from the "Claude Fable 5 Level" templates in the Routing DSL editor (they ship in every workspace and mirror the Fusion routers), then specialize. Six copy-paste patterns:

1 — Ship code that actually runs → fan out, let the tests pick the winner:

- id: hard_code
  when: task_class == "code" && difficulty > 0.6
  use:
    parallel:
      - { model: "anthropic/claude-opus-4.8", thinking_budget_tokens: 16000 }
      - { model: "openai/gpt-5.5", reasoning_effort: high }
      - { model: "google/gemini-3.1-pro-preview" }
    arbiter: { strategy: tests_pass }

tests_pass is execution-grounded — it serves the candidate that passes your harness, no judge LLM needed.

2 — Stop overpaying for easy prompts → difficulty gate (the Fusion pattern, your models):

- id: easy
  when: difficulty < 0.3
  use: { delegate: cheapest }
- id: hard
  when: difficulty >= 0.3
  use:
    parallel:
      - { model: "anthropic/claude-opus-4.8" }
      - { model: "openai/gpt-5.5" }
    arbiter: { strategy: best_of_n, model: "anthropic/claude-opus-4.8", template: best_answer_v1 }

3 — Keep long agent runs on the rails → escalate only when it wobbles:

- id: agent
  when: task_class == "agent" && agent_state.consecutive_errors == 0
  use: { model: "anthropic/claude-sonnet-4.6", affinity_ttl: "5m" }
  on_low_confidence:
    signals: [next_turn_test_failed, self_doubt]
    use: { model: "anthropic/claude-opus-4.8", thinking_budget_tokens: 24000 }

4 — Make flaky outputs deterministic → vote, escalate on a split:

- id: extract
  when: task_class == "rag"
  use:
    parallel:
      - { model: "anthropic/claude-opus-4.8" }
      - { model: "openai/gpt-5.5" }
      - { model: "google/gemini-3.1-pro-preview" }
    arbiter: { strategy: majority }
    on_disagreement: { model: "anthropic/claude-opus-4.8", thinking_budget_tokens: 32000 }

5 — Beat tail latency & provider blips → race, serve the first responder:

- id: race
  when: request.stream == true && difficulty < 0.5
  use:
    parallel:
      - { model: "google/gemini-3.5-flash" }
      - { model: "minimax/minimax-m2.7" }
      - { model: "z-ai/glm-5.1" }
    arbiter: { strategy: first }

6 — Roll out without betting the SLA → shadow (evaluate alongside live traffic, log what it would pick + cost delta, serve the live pick) → canary % (dsl_canary_pct 5 → 25 → 100, crypto-random per request). Migrate on measured divergence, roll back instantly.


The cheat-sheet: five arbiters

Economics & honesty

Difficulty-gated fan-out keeps the bill flat (Illustrative; cost = real token-price math) — blended cost = easy_share × cheap + hard_share × panel:

A 70%-easy workload runs the full panel for a third of the all-panel bill.