
Routing in LLM inference is the difference between scaling and stalling
Routing in LLM inference is the difference between scaling and stalling
TLDR
When running more than one vLLM replica, routing strategy has a significant impact on performance.
On a real agent-style workload, using cache-aware routing instead of default Kubernetes routing:
- Reduced average inference step time by ~50 percent
- Reduced total runtime by ~36 percent
- Reduced P95 latency from more than a minute to under 20 seconds
The model, hardware and workload were identical. The difference was how requests were routed across replicas.
This blog describes the workload, the routing strategies compared and why cache awareness matters for stateful inference workloads.
Routing once inference becomes distributed
Many inference discussions focus on model architecture, kernels or hardware efficiency. These factors remain important at all scales.
However, once inference is distributed across multiple replicas, routing strategy becomes an additional determinant of performance. While requests remain independent at the protocol level, their content becomes relevant for scheduling decisions. When successive requests share context, preserving execution locality across replicas affects how much prior work can be reused.
This post examines one concrete workload where routing strategy materially changed system behavior.
Workload and experiment setup
This is not a synthetic throughput benchmark. It is an agent-style workload.
We ran SWE-style agent trajectories by using devstral-2-small-2512, with tool calls, growing context windows and sustained concurrency. Each trajectory consists of many sequential inference steps that reuse prior context rather than issuing independent requests.
This type of workload stresses inference systems differently from stateless completion traffic. Requests are correlated, context reuse is uneven and performance depends heavily on how state is handled across steps.
Configuration:
- 300 trajectories executed concurrently
- 6 independent runs per configuration
- vLLM as the inference backend
- Either 1 or 3 vLLM replicas
- Identical model weights and hardware in all cases
The single-replica setup serves as a baseline where cache locality is implicit. The three-replica setup introduces distributed state, making routing decisions relevant to whether context is reused efficiently or repeatedly reconstructed.
Each run consists of thousands of inference steps with skewed context lengths and non-uniform arrival patterns, which are typical of agent and chat workloads but uncommon in stateless benchmarks.
Routing strategies compared
Every system routes requests in some way, even if only through default Kubernetes services.
In this experiment, we compared two configurations:
-
Default Kubernetes ClusterIP routing, which distributes requests without regard to request content or model state. In practice, request placement is influenced by
client-side connection reuse and does not attempt to preserve execution locality. -
Cache-aware routing implemented in the inference router. For each request, the router attempts to select a replica with the best KV cache match for the request. If such a replica is overloaded, the router falls back to selecting a less-loaded replica.
This means the comparison is between a worst-case routing strategy that is unaware of both request content and cache state and a best-case strategy that combines cache awareness with basic load protection.
No standalone load-aware or least-request-only routing baseline was evaluated in this experiment.
Three replicas: Observed impact
With three vLLM replicas, the difference between routing strategies is substantial.
With cache-aware routing:
- Average inference step time: 4.8 seconds
- P90 latency: 9.6 seconds
- P95 latency: 17 seconds
Total inference time across all steps: ~93k seconds
With default Kubernetes routing:
- Average inference step time: 9.0 seconds
- P90 latency: 15.2 seconds
- P95 latency: 70.8 seconds
Total inference time across all steps: ~183k seconds
Average inference time dropped by roughly 49 percent. Total runtime dropped by about 36 percent.
Same model. Same GPUs. Same traffic. Only routing changed.
Interpreting latency distributions
Median latency differs little between configurations.
This is expected. Median values are relatively insensitive to queueing effects and cache misses.
The primary difference appears in the tail.
With default routing, P95 latency increases sharply due to repeated cold prefills and uneven load distribution across replicas. Individual slow steps propagate through agent trajectories and extend overall runtime.
With cache-aware routing, repeated prefills are avoided more often, leading to tighter latency distributions and lower tail latency.
Scaling replicas without routing makes things worse
Scaling from one to three replicas without cache-aware routing does not improve performance in this workload.
Average step time increases and tail latency worsens. The primary cause is cache fragmentation: context for a given trajectory is repeatedly reconstructed on different replicas, increasing prefill cost.
This illustrates that horizontal scaling alone is insufficient for stateful inference workloads.
What the router actually changes
Traditional L7 load balancers can preserve affinity at the connection or client level. However, they do not inspect request content or model-specific state.
The router used in this experiment makes routing decisions based on request content and model execution state, allowing it to route requests preferentially to replicas that are more likely to reuse existing KV cache.
Its primary effect is on where related requests execute over time.
For agent-style workloads, multiple sequential requests often share large portions of context. When routing preserves locality, later steps can reuse KV cache produced by earlier ones. When locality is lost, each step pays the full prefill cost again.
In the cache-aware configuration, requests belonging to the same trajectory are more likely to be scheduled on replicas that already hold relevant context. In the default Kubernetes configuration, requests are distributed without regard to prior execution history.
This difference affects how often replicas perform expensive prefills, how decode batches form and how queueing behavior evolves under load.
Cache behavior as the dominant factor
KV cache reuse is the primary driver behind the observed performance differences.
In this experiment, all requests share a large system prompt, so some cache reuse occurs even without cache-aware routing. However, routing strategy significantly affects how much additional context is reused across steps.
Across three replicas:
- With cache-aware routing, approximately 95 percent of tokens were served from cache.
- Without cache-aware routing, approximately 62 percent of tokens were served from cache.
This difference translates directly into reduced prefill work and lower total inference time. When cache locality is preserved, decode dominates execution for many steps. When locality is lost, repeated prefills dominate instead.
Cache behavior therefore materially affects both latency and total compute consumption for this class of workload.
Why the single-replica case looks different
With a single vLLM replica, routing strategy has little impact.
All requests necessarily execute on the same process, and cache locality is implicit. As a result, the performance characteristics of the single-replica setup are largely determined by model behavior and request mix rather than routing.
The role of routing becomes visible once inference is distributed. At that point, preserving locality across replicas determines whether scaling improves or degrades performance.
The system-level takeaway
Inference performance does not scale linearly with replica count for stateful workloads.
Once requests carry context across steps, replicas are no longer interchangeable. Routing becomes part of the execution path, not just an infrastructure concern.
Different routing strategies form a spectrum, from connection-based and round-robin approaches, to load-aware scheduling, to cache- and memory-aware routing. The results in this post show that ignoring cache locality can negate the benefits of horizontal scaling.
This is why routing decisions disproportionately affect tail latency and total runtime rather than median throughput.
Limitations and next steps
Cache-aware routing alone is not sufficient.
KV cache is constrained by GPU memory, and under sustained load or large contexts, eviction is unavoidable once the working set exceeds device capacity. Routing cannot prevent this on its own.
Handling these scenarios may benefit from additional signals such as memory pressure or cache eviction behavior being taken into account during routing. These aspects were not evaluated in this experiment.
We will cover cache eviction and memory-aware routing in a follow-up post.
If this problem looks familiar
If you are running agent or chat workloads on open models and recognize these failure modes, this is exactly the class of system Token Factory
We work with teams to profile real traffic, identify execution-path bottlenecks, design serving stacks that hold under worst-case inputs and tune models and inference pipelines together rather than in isolation.
You can explore Token Factory at tokenfactory.nebius.com
We are not trying to make benchmarks look better, we are trying to make systems behave predictably when users show up.
Explore Nebius Token Factory
Contents



