LLM Inference Optimization: The Line Item That Decides If Your AI Ships

Training Gets the Headlines. Inference Gets the Bill.

If you run LLMs in production, inference is almost certainly your biggest AI line item. Training a frontier model is a one-time spectacle; serving one is a meter that runs 24/7, on every request, forever. The gap between naive and optimized serving is not marginal — it is routinely 5-10x in cost and 3-5x in latency. A single unoptimized 70B model can cost north of $100/hour on high-end GPUs; optimized for equivalent throughput, that can drop to roughly $15-20/hour.

For any company shipping AI features — including across a real-estate portfolio where the same workflow runs across thousands of units — this is the difference between a feature that's economically viable and one that quietly gets cut. Inference optimization is not a research nicety; it's what moves a model from a notebook demo to production economics.

5-10x

lower cost, optimized

80-90%

GPU utilization, batched

3-5x

lower latency

The Counterintuitive Bottleneck: Memory, Not Compute

The first thing to internalize: during token generation, LLM inference is memory-bandwidth bound, not compute bound. An H100 has ~3.35 TB/s of memory bandwidth but ~989 TFLOPS of FP16 compute. During autoregressive decoding — generating one token at a time — the GPU spends most of its time waiting for model weights and KV-cache data to stream from memory, using only ~10-20% of its compute. Almost every optimization below attacks the same root cause: move less data, and make better use of the data you do move.

1. Tame the KV Cache

The Key-Value cache is the most important concept in inference optimization. Every new token attends to all previous tokens; the KV cache stores those key/value projections so you don't recompute the whole sequence each step. The problem is size — for a 70B-class model, the KV cache for a few thousand tokens across a modest batch can hit tens of gigabytes, often more than the (quantized) weights themselves. Three levers:

PagedAttention (vLLM). Instead of reserving contiguous memory for the max sequence length, it allocates the KV cache in fixed-size blocks mapped through a page table — borrowing OS virtual-memory paging. Memory waste falls from 60-80% to near zero, letting you serve 2-3x more concurrent requests on the same GPU. It's now the default in most serving stacks.
Prefix caching. When many requests share a prefix — a long system prompt, few-shot examples, or the same retrieved documents in a RAG pipeline — compute that prefix's KV cache once and reuse it. For a 2,000-token system prompt served to thousands of users, the savings are enormous.
Grouped-Query Attention (GQA). Sharing KV heads across query heads (e.g. 8 KV heads for 64 query heads) shrinks the KV cache several-fold. It's a model-architecture choice, but it's central to capacity planning.

2. Continuous Batching

Static batching wastes the GPU: the whole batch waits for the slowest sequence to finish. Continuous (in-flight) batching swaps finished sequences out and new ones in at each step, keeping the GPU saturated. This is the single biggest throughput win — it's what takes utilization from a sad 20-30% to 80-90% and is why frameworks like vLLM, SGLang, and TensorRT-LLM exist.

3. Quantization

Quantization shrinks weights (and sometimes the KV cache) from FP16 to FP8, INT8, or INT4. Less data to move through memory means directly lower latency and cost, and a smaller memory footprint means bigger batches or cheaper GPUs. Modern methods (AWQ, GPTQ, FP8) preserve most quality, but there is a tradeoff — measure quality on your own tasks rather than trusting a benchmark. For many production workloads, 8-bit is nearly free and 4-bit is an acceptable trade for the cost savings.

4. Speculative Decoding

A small, fast "draft" model proposes several tokens ahead; the large model verifies them in a single forward pass, accepting the ones it agrees with. When the draft is good, you get multiple tokens per big-model step — meaningful latency reduction with no quality loss (the big model still has final say). It adds complexity and works best on predictable text, but it's a strong lever for latency-sensitive apps.

5. Right-Size the Model

The cheapest token is the one you never compute on an oversized model. Route easy requests to a smaller or distilled model and reserve the frontier model for genuinely hard ones — the orchestration argument we made about model routing. And if data sovereignty or volume justifies it, a well-optimized self-hosted open-weight model can undercut per-token API pricing dramatically — but only once you've done the optimization work above; an unoptimized self-host is just an expensive way to lose money.

What Actually Matters in Practice

Use a real serving framework. vLLM, SGLang, or TensorRT-LLM give you PagedAttention, continuous batching, prefix caching, and quantization out of the box. Hand-rolling inference in 2026 is almost never the right call.
Measure before optimizing. Profile your actual prompt/response shapes — long shared prefixes favour prefix caching; high concurrency favours batching; long outputs favour KV-cache and quantization work.
Optimize the system, not just the model. Throughput, tail latency, and cost-per-1k-tokens are the numbers the business feels — track those, not just tokens/sec on a single request.

Frequently Asked Questions

Why is inference more expensive than training over time? Training is a one-time cost; inference runs on every user request, continuously, at scale. Within months of launch, serving compute typically dominates total cost of ownership.

What single change gives the biggest win? For most workloads, moving to a serving framework with continuous batching and PagedAttention (e.g. vLLM) — it can lift GPU utilization from ~20-30% to 80-90% and serve several times more concurrent requests on the same hardware.

Does quantization hurt quality? It can, but modern 8-bit methods are nearly lossless for most tasks and 4-bit is often an acceptable trade. Always validate on your own evaluation set rather than assuming a published benchmark transfers.

The Takeaway

Inference optimization is where AI economics are won or lost. The techniques are well understood — KV-cache management, continuous batching, quantization, speculative decoding, right-sizing — and together they routinely cut serving costs 5-10x. That's often the deciding factor in whether an AI feature ships at all. VSBD builds and optimizes the serving and orchestration layer behind production AI systems for PropTech and real estate platforms. If your AI roadmap is being held back by inference cost, we can help you make it economical.