Cutting multimodal LLM latency with KV-cache offload and prefix-aware routing
An open-source, OpenAI-compatible inference gateway in front of a vLLM fleet, benchmarked on a 7B vision-language model.
Prefix-aware routing vs round-robin
Two Llava-OneVision-7B replicas, one per GPU. Each replica's GPU KV cache is capped so the 12 distinct 1024×1024 images cannot all stay resident. 120 requests, concurrency 8.
Round-robin scatters each image across both replicas, so each one keeps re-prefilling roughly 6.5k vision tokens. Prefix-aware keeps each image's KV resident on a single replica, and a load guard preserves balance so popular images do not pile onto one replica.
LMCache CPU KV offload under GPU memory pressure
Llava-OneVision-7B, 40 distinct 1024×1024 images, 80 requests. The GPU KV cache is capped to force the working set to overflow GPU memory, the regime CPU offload is built for.
These gains appear only under memory pressure. At the tightest cap, vLLM's own prefix-cache hit rate fell to ~41.0% (thrashing) while CPU offload kept the multimodal KV warm. When the working set fits in GPU memory, the engine's own cache suffices and there is no gain.
Where does the offloaded KV live? CPU vs Redis
Llava-OneVision-7B, 40 distinct 1024×1024 images, 80 requests, with the GPU KV cache capped to force overflow. Three LMCache configurations, identical workload.
| Config | TTFT p50 | TTFT p95 | Throughput | Memory used (proof) |
|---|---|---|---|---|
GPU only (baseline) recompute on evict | 1393 ms | 2583 ms | 1.11 req/s | none |
+ LMCache → CPU RAM local_cpu: true | 249 ms | 412 ms | 2.33 req/s | CPU RAM +35 GB |
+ LMCache → Redis (direct) local_cpu: false | 1424 ms | 2686 ms | 1.08 req/s | Redis +4.7 GB |
CPU offload cut TTFT 5.6× and doubled throughput, consuming about 35 GB of CPU RAM (LMCache logged 13 to 22 GB/s store and retrieve). Redis stored 4.7 GB of KV remotely, confirming the cross-host tier works. On loopback it is not a latency win; its value is capacity and cross-host sharing.