Sakana Fugu: When Orchestrating Models Beats Owning One

A Model That Conducts, Instead of Competes

Every story in AI this year has quietly pointed the same direction: the edge is moving from the model to the layer around it. Sakana AI's newly launched Fugu (and the heavier Fugu Ultra) is the most literal expression of that idea yet — a system that beats frontier models by conducting them, without training a frontier model of its own.

Fugu isn't a bigger LLM. It's a small (~7B) model trained to do one thing well: take a task, decide which strong model in a pool should handle each part — Gemini 3.1 Pro, Claude Opus 4.8, GPT-5.5 — dispatch the work, and synthesize one answer. It can even call itself recursively on long tasks: run, read its own previous output, and revise. Two tiers ship behind one OpenAI-compatible API (subscriptions from ~$20): regular Fugu for everyday speed/quality, Fugu Ultra trading latency for a wider expert pool on hard, multi-step work.

The Claimed Results (and the Asterisks)

On Sakana's own benchmarks the orchestrator is convincing — it reportedly beats Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 on ten of eleven tests:

SWE-Bench Pro: Fugu Ultra 73.7 vs Opus 4.8 69.2, GPT-5.5 58.6, Gemini 3.1 Pro 54.2.
Humanity's Last Exam: Fugu Ultra 50.0, edging Opus 4.8 (49.8).
GPQA-D: 95.5, top of the field (both Fugu tiers).
Curiously, on some tasks (SciCode, banking, long-context) the regular Fugu beats Ultra. The only loss: MRCRv2, where GPT-5.5 (94.8) stays ahead of Fugu Ultra (93.6).

Two honest caveats matter as much as the numbers. First, these are vendor-reported results with no independent verification yet. Second, Anthropic's strongest models — Fable 5 and Mythos — are not in Fugu's pool because they aren't publicly available; so Fugu matches the frontier by orchestrating the models it can reach, not the absolute best ones. Treat the leaderboard as a claim, not a fact.

The Real Insight: Resilience as a Feature

The most interesting part of Sakana's pitch isn't the benchmarks — it's the framing of orchestration as insurance. Because Fugu routes across multiple providers, if one model gets restricted, rate-limited, repriced, or pulled, it simply reroutes to the rest of the pool. Sakana explicitly markets this against the risk of access disappearing overnight due to export or regulatory changes.

That is a procurement argument, not a research one — and it lands with anyone running AI in production. Single-vendor dependence is a real operational risk: an outage, a price hike, a policy change, or a deprecated model can break a workflow you depend on. A routing layer turns "our AI went down because OpenAI did" into "it failed over."

What It Means for Real Estate and PropTech

Fugu is a consumer/developer product, but the pattern is exactly what we build into PropTech platforms. The lesson for operators: you don't need to own or train the best model — you need to orchestrate the right ones for each job.

Quality-and-cost routing. Send routine lease-summary or classification work to a cheap model; escalate ambiguous valuation or contract-risk questions to a frontier model. Pay for power only when the task needs it.
Vendor resilience. A multi-provider layer means a single model's outage or price change doesn't take your leasing assistant or maintenance triage offline.
Best-tool-per-task. One model is better at code/structured extraction, another at long-context documents, another at reasoning — routing beats betting the platform on a single provider.
It compounds the rest. This is the same throughline as the agentic orchestration layer, self-improving harnesses, and AI teammates: the durable value is in the orchestration and governance, not any one model.

The Tradeoffs You Inherit

Orchestration is not a free lunch, and the honest version says so:

Latency and cost. Calling several frontier models and synthesizing is slower and can be more expensive per query than a single call. Worth it for hard tasks; wasteful for easy ones — which is exactly why the routing has to be smart.
Data governance multiplies. Every provider you route to is another place your data goes. For sensitive tenant or financial data this is a real concern — and the mirror image of the self-hosted open-weight approach. Orchestration buys capability and resilience; self-hosting buys sovereignty. Most serious platforms end up doing both, deliberately.
Unverified claims. Until third parties reproduce the benchmarks, treat "beats the frontier" as marketing. The architecture is the takeaway, not the scoreboard.

Frequently Asked Questions

What is Sakana Fugu? A multi-model orchestration system from Japan's Sakana AI. A small (~7B) model routes each task across a pool of frontier LLMs (Gemini 3.1 Pro, Claude Opus 4.8, GPT-5.5), can call itself recursively, and synthesizes a single answer — exposed through one OpenAI-compatible API as Fugu and Fugu Ultra.

Does Fugu beat GPT-5.5 and Opus 4.8? Sakana reports it wins on 10 of 11 benchmarks, but the results are vendor-reported and not yet independently verified, and Anthropic's top models (Fable 5, Mythos) aren't in its pool. Read it as a strong claim, not settled fact.

Why would a business use an orchestrator instead of one model? Cost/quality routing, best-tool-per-task, and resilience — a multi-provider layer fails over instead of going down when one model is restricted, repriced, or deprecated. The cost is added latency, complexity, and broader data exposure.

The Takeaway

Whether or not Fugu's exact numbers hold, it makes the year's quiet thesis loud: orchestration is becoming a frontier capability in its own right. For real estate operators, that's liberating — frontier-grade outcomes no longer require owning a frontier model, just the engineering to route, govern, and synthesize across the ones that exist. That orchestration-and-governance layer is what we build — so your platform gets the upside without betting everything on a single model or vendor.