KV cache

Letter K

Stored key-value tensors that let models skip recomputing past tokens. Drives inference cost in long-context workloads.

Updated:

Working definition

The KV cache stores key-value tensors from attention so models can skip recomputing past tokens. This is what fills up in long-context workloads. It’s also what’s driving your inference cost. Long context isn’t free. The KV cache is the receipt.