SaaSHybrid#AI/ML#MLOps#On-prem/Edge#Cloud

Cutting an LLM bill ~40–60% by moving steady traffic off the cloud API

A mid-size SaaS analytics company

[A mid-size SaaS analytics company] · AI/ML · MLOps · On-prem/Edge · Cloud · Hybrid

Cutting an LLM bill ~40–60% by moving steady traffic off the cloud API

Context

A mid-size SaaS analytics company had shipped an LLM-powered feature that customers genuinely liked. Success was the problem: as adoption grew, their per-token cloud API bill grew right alongside it — faster than revenue, and with no ceiling in sight. They came to us not to build something new, but because the unit economics of a feature they'd already shipped were quietly turning sour.

Challenge

The feature worked well, so "make it cheaper by making it worse" was off the table — quality had to hold. The bill was a linear per-token meter that scaled with every customer interaction, and the workload underneath it was largely steady-state: high, fairly predictable volume on a task that didn't require frontier-model capability. That's the textbook profile where a per-token API is the wrong lease, but the team had built on the API out of habit and had never run the comparison. They needed someone to tell them honestly whether self-hosting was actually cheaper for their workload — and then to build it without a quality regression or a risky big-bang cutover.

Approach

We started with the math, not a sales pitch. We profiled their actual traffic — token volumes, the shape of input vs output, how steady it really was, where the peaks landed — and benchmarked open-weights models against their existing API output on their specific tasks, using their own examples as the quality bar. The question we set out to answer was concrete: at this volume and this required quality, where is the crossover point, and are they past it?

They were, comfortably, for steady-state traffic. But not for everything — bursts and a tail of harder requests were better left on the cloud. So the recommendation wasn't "rip out the API." It was a hybrid: serve the predictable baseline on self-hosted open-weights models where the economics are flat, and keep the cloud API for burst overflow and the harder tail where it still earns its cost.

We were candid about where cloud was still the right answer. That honesty is the point — the goal was their lowest total cost at acceptable quality, not the maximum amount of infrastructure for us to build.

Architecture

The result is a hybrid routing architecture that sends each request to the cheapest endpoint that can serve it well.

Self-hosted baseline: an open-weights model served on rented GPUs with continuous batching, sized to keep utilization high on the steady-state traffic. High occupancy is what makes the per-token cost collapse, so capacity was planned against real load tests, not guesses.
Cloud burst and tail: the existing cloud API stays wired in as overflow for spikes (so we never provision self-hosted capacity for the peak) and as the path for the harder request tail.
Routing layer: a router directs traffic by load and request type — baseline to self-hosted, spillover and hard cases to cloud — presenting one stable interface to the application.
Quality gates and monitoring: continuous comparison against the quality bar on their tasks, plus cost and latency monitoring, so any drift in quality or economics is visible.
A cost-vs-latency playbook: we handed over the decision logic — when to add a GPU, when to lean on cloud burst, how to re-run the break-even as volume changes — so the team owns the trade-off going forward.

This is the deploy-anywhere thesis made concrete: the right answer was neither pure cloud nor pure self-hosting, but a measured mix — and the value was in knowing where the line sat.

Results

Inference spend down ~40–60% versus the all-cloud-API baseline.
Quality parity on the client's tasks — measured against their own examples, not assumed.
A hybrid architecture that keeps cloud elasticity for spikes while taking the flat economics on steady traffic.
A cost-vs-latency playbook the team owns, so they can re-run the break-even as their volume and traffic shape evolve.

Stack

Open-weights LLM on rented GPUs · vLLM-style continuous-batched inference · cloud LLM API (burst + tail) · request-routing layer · quality-eval harness · cost and latency monitoring · hybrid (self-hosted + cloud) infrastructure.

This is our core differentiator: we'll tell you honestly when the cloud is the wrong answer — and when it isn't. If your inference bill is outgrowing your traffic, see the deploy work we do →, or read the worked break-even math behind a decision like this.

Have a similar problem?

Talk to us