KVGate · Multimodal Inference Gateway benchmarks

KVGate

Multimodal Inference Gateway

Open source · Benchmarked on GPUs

Cutting multimodal LLM latency with KV-cache offload and prefix-aware routing

An open-source, OpenAI-compatible inference gateway in front of a vLLM fleet, benchmarked on a 7B vision-language model.

Read the docs View on GitHub ↗

1.84×

Lower tail latency

Tail latency (TTFT p95) versus round-robin routing

2.0×

Lower TTFT

Time-to-first-token with CPU KV offload under memory pressure

98.6%

Routing-affinity hit rate

Requests routed to a replica with the cache already warm

+65%

Peak throughput gain

Throughput under the tightest memory cap

Experiment 1 · 2-GPU fleet

Prefix-aware routing vs round-robin

Two Llava-OneVision-7B replicas, one per GPU. Each replica's GPU KV cache is capped so the 12 distinct 1024×1024 images cannot all stay resident. 120 requests, concurrency 8.

568

434

2783

1516

3642

2662

p50

p95

p99

round_robin (KV-blind)

prefix_kv_aware

(TTFT ms, lower is better)

TTFT p95: 2783 → 1516 ms (1.84× lower)Throughput +14%98.6% routing-affinity hit rateBalanced load: 73 / 71

Round-robin scatters each image across both replicas, so each one keeps re-prefilling roughly 6.5k vision tokens. Prefix-aware keeps each image's KV resident on a single replica, and a load guard preserves balance so popular images do not pile onto one replica.

Experiment 2 · single GPU

LMCache CPU KV offload under GPU memory pressure

Llava-OneVision-7B, 40 distinct 1024×1024 images, 80 requests. The GPU KV cache is capped to force the working set to overflow GPU memory, the regime CPU offload is built for.

Time-to-first-token

855

422

1426

902

KV cap 3072

KV cap 2560

vLLM cache only

+ LMCache

(TTFT ms, lower is better)

Throughput

1.18

1.37

0.92

1.52

KV cap 3072

KV cap 2560

vLLM cache only

+ LMCache

(req/s)

Up to 2.0× lower TTFT+65% throughput at the tightest cap

These gains appear only under memory pressure. At the tightest cap, vLLM's own prefix-cache hit rate fell to ~41.0% (thrashing) while CPU offload kept the multimodal KV warm. When the working set fits in GPU memory, the engine's own cache suffices and there is no gain.

Experiment 3 · KV offload hierarchy

Where does the offloaded KV live? CPU vs Redis

Llava-OneVision-7B, 40 distinct 1024×1024 images, 80 requests, with the GPU KV cache capped to force overflow. Three LMCache configurations, identical workload.

Config	TTFT p50	TTFT p95	Throughput	Memory used (proof)
GPU only (baseline) recompute on evict	1393 ms	2583 ms	1.11 req/s	none
+ LMCache → CPU RAM local_cpu: true	249 ms	412 ms	2.33 req/s	CPU RAM +35 GB
+ LMCache → Redis (direct) local_cpu: false	1424 ms	2686 ms	1.08 req/s	Redis +4.7 GB

CPU offload TTFT

5.6× lower

CPU offload throughput

2.1× higher

KV stored in Redis

+4.6 GB

LMCache retrieve

22.1 GB/s

CPU offload cut TTFT 5.6× and doubled throughput, consuming about 35 GB of CPU RAM (LMCache logged 13 to 22 GB/s store and retrieve). Redis stored 4.7 GB of KV remotely, confirming the cross-host tier works. On loopback it is not a latency win; its value is capacity and cross-host sharing.

How it works

Architecture & stack

Client

OpenAI-compatible request (text + image)

→

KVGate gateway

hashes prefix incl. image bytes → picks warm replica

→

vLLM replica fleet

GPU prefix cache + LMCache CPU KV offload

FastAPIvLLM 0.11LMCache 0.3.7RedisPrometheusGrafanaNext.js 14Docker