KVGate
Multimodal Inference Gateway
Open source · Benchmarked on GPUs

Cutting multimodal LLM latency with KV-cache offload and prefix-aware routing

An open-source, OpenAI-compatible inference gateway in front of a vLLM fleet, benchmarked on a 7B vision-language model.

1.84×
Lower tail latency
Tail latency (TTFT p95) versus round-robin routing
2.0×
Lower TTFT
Time-to-first-token with CPU KV offload under memory pressure
98.6%
Routing-affinity hit rate
Requests routed to a replica with the cache already warm
+65%
Peak throughput gain
Throughput under the tightest memory cap
Experiment 1 · 2-GPU fleet

Prefix-aware routing vs round-robin

Two Llava-OneVision-7B replicas, one per GPU. Each replica's GPU KV cache is capped so the 12 distinct 1024×1024 images cannot all stay resident. 120 requests, concurrency 8.

568
434
2783
1516
3642
2662
p50
p95
p99
round_robin (KV-blind)
prefix_kv_aware
(TTFT ms, lower is better)
TTFT p95: 27831516 ms (1.84× lower)Throughput +14%98.6% routing-affinity hit rateBalanced load: 73 / 71

Round-robin scatters each image across both replicas, so each one keeps re-prefilling roughly 6.5k vision tokens. Prefix-aware keeps each image's KV resident on a single replica, and a load guard preserves balance so popular images do not pile onto one replica.

Experiment 2 · single GPU

LMCache CPU KV offload under GPU memory pressure

Llava-OneVision-7B, 40 distinct 1024×1024 images, 80 requests. The GPU KV cache is capped to force the working set to overflow GPU memory, the regime CPU offload is built for.

Time-to-first-token
855
422
1426
902
KV cap 3072
KV cap 2560
vLLM cache only
+ LMCache
(TTFT ms, lower is better)
Throughput
1.18
1.37
0.92
1.52
KV cap 3072
KV cap 2560
vLLM cache only
+ LMCache
(req/s)
Up to 2.0× lower TTFT+65% throughput at the tightest cap

These gains appear only under memory pressure. At the tightest cap, vLLM's own prefix-cache hit rate fell to ~41.0% (thrashing) while CPU offload kept the multimodal KV warm. When the working set fits in GPU memory, the engine's own cache suffices and there is no gain.

Experiment 3 · KV offload hierarchy

Where does the offloaded KV live? CPU vs Redis

Llava-OneVision-7B, 40 distinct 1024×1024 images, 80 requests, with the GPU KV cache capped to force overflow. Three LMCache configurations, identical workload.

ConfigTTFT p50TTFT p95ThroughputMemory used (proof)
GPU only (baseline)
recompute on evict
1393 ms2583 ms1.11 req/snone
+ LMCache → CPU RAM
local_cpu: true
249 ms412 ms2.33 req/sCPU RAM +35 GB
+ LMCache → Redis (direct)
local_cpu: false
1424 ms2686 ms1.08 req/sRedis +4.7 GB
CPU offload TTFT
5.6× lower
CPU offload throughput
2.1× higher
KV stored in Redis
+4.6 GB
LMCache retrieve
22.1 GB/s

CPU offload cut TTFT 5.6× and doubled throughput, consuming about 35 GB of CPU RAM (LMCache logged 13 to 22 GB/s store and retrieve). Redis stored 4.7 GB of KV remotely, confirming the cross-host tier works. On loopback it is not a latency win; its value is capacity and cross-host sharing.

How it works

Architecture & stack

Client
OpenAI-compatible request (text + image)
KVGate gateway
hashes prefix incl. image bytes → picks warm replica
vLLM replica fleet
GPU prefix cache + LMCache CPU KV offload
FastAPIvLLM 0.11LMCache 0.3.7RedisPrometheusGrafanaNext.js 14Docker