vLLM KVCache Analysis
vLLM Settings
Local source paths
The source paths below are relative to the local vLLM checkout used for this note.
vllm/engine/arg_utils.pyvllm/config/cache.pyvllm/config/scheduler.pyvllm/distributed/kv_events.pyvllm/ray/ray_env.py
- Environment variables
PYTHONHASHSEED=<int>- Included in the Ray worker environment allowlist.
- Set it when deterministic Python hash behavior is required across scheduler and worker processes.
- Scheduler
--enable-chunked-prefill
- Cache
--enable-prefix-caching--block-size <size>- If omitted,
CacheConfig.DEFAULT_BLOCK_SIZEis used before platform/backend alignment.
- If omitted,
--prefix-caching-hash-algo <algo>- Specifies the hash algorithm.
sha256: default; uses Pickle serialization and SHA-256.sha256_cbor: uses canonical CBOR serialization and SHA-256.xxhash: uses Pickle serialization and xxHash.xxhash_cbor: uses canonical CBOR serialization and xxHash.
- vLLM
--kv-events-config <configJson>- Configuration for the KVEvent publisher.
enable_kv_cache_events: truepublisher: zmqendpoint: tcp://<host>:<port>topic: <topic>
--kv-transfer-config '{"<option>": "<value>", ...}'<option>kv_connector: <KVConnectorBaseV1Impl>- Specifies the KVConnector implementation.
- e.g.
NixlConnector,MultiConnector
kv_role: <role>- Specifies one of
kv_producer,kv_consumer, orkv_both.
- Specifies one of
kv_connector_module_path: <module>- Specifies the module path when using a KVConnector implementation that is not included in vLLM.
- e.g.
lmcache.integration.vllm.lmcache_connector_v1
Sliding Window Attention
References
The general KVCache sizing discussion for Sliding Window Attention lives in the science-domain KV Cache documentation. Keep this vLLM page focused on vLLM configuration, connector lifecycle, and prefill scheduling flow.
KVConnectorBase_V1
References
Local source paths
The source paths below are relative to the local vLLM checkout used for this note.
vllm/distributed/kv_transfer/kv_connector/v1/base.pyvllm/v1/core/sched/scheduler.pyvllm/v1/worker/gpu/kv_connector.pyvllm/v1/worker/kv_connector_model_runner_mixin.pyvllm/model_executor/layers/attention/kv_transfer_utils.pyvllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py
A single vLLM instance is structured as follows.
Runtime Flow
Worker-Side Hooks
register_kv_caches()- Called during worker connector initialization with the per-layer KV cache tensors.
- Used by connectors that need to pre-register KV buffers, such as NIXL-style transfers.
register_cross_layers_kv_cache()- Optional setup hook for a single cross-layer KV cache tensor whose first dimension is
num_layers. - Used only when the connector prefers cross-layer blocks and the model has a uniform layer layout.
- Optional setup hook for a single cross-layer KV cache tensor whose first dimension is
set_host_xfer_buffer_ops()- Provides xPU-specific host/device copy operations for connectors that transfer through host buffers.
bind_connector_metadata()- Called before model execution with scheduler-built metadata.
- The metadata drives runtime KV load and save behavior on the worker.
handle_preemptions()- Called before
start_load_kv()in the worker pre-forward path. - Lets async-save connectors preserve preempted requests or evicted blocks before they are overwritten.
- Called before
start_load_kv()- Worker-side method called from the forward context before the forward pass.
- Starts loading KV from the connector into vLLM's paged KV buffer, potentially asynchronously.
wait_for_layer_load()- Worker-side method called from inside attention layers.
- Blocks until the KV for a specific layer has been loaded.
- This enables layer-by-layer pipelining when
start_load_kv()is asynchronous.
save_kv_layer()- Worker-side method called from inside attention layers.
- Starts saving a layer of paged KVCache from vLLM to the connector, potentially asynchronously.
wait_for_save()- Worker-side method called as the forward context exits.
- Blocks until all save operations are complete, preventing paged KV buffer overwrite before async saves finish.
get_finished()- Called with finished request IDs after model execution.
- Returns request IDs whose asynchronous send/save or receive/load operations have completed.
get_block_ids_with_load_errors()- Returns block IDs whose external KV loads failed.
- The scheduler uses this to adjust computed-token state and recompute invalid blocks.
get_kv_connector_stats()- Returns connector statistics collected during the last interval.
get_kv_connector_kv_cache_events()- Returns worker-side KV cache events collected during model execution.
get_handshake_metadata()- Returns optional out-of-band handshake metadata for P/D worker coordination.
build_connector_worker_meta()- Builds worker metadata for the current engine step and sends it back through
KVConnectorOutput.
- Builds worker metadata for the current engine step and sends it back through
clear_connector_metadata()- Called after model execution to clear the metadata bound before the forward pass.
no_forward()- Worker path that can run connector pre/post processing without a model forward when connector output is available independently.
Scheduler-Side Hooks
get_num_new_matched_tokens()- Scheduler-side method that returns how many additional prefix tokens are available from external KVCache beyond the locally computed tokens.
- May return
Nonewhen the connector needs more time and the scheduler should query again later. - The returned boolean indicates whether the external KV load will happen asynchronously.
update_state_after_alloc()- Scheduler-side method called after the KV cache manager allocates blocks for a request.
- Receives the request, allocated blocks, and the number of external tokens that will be loaded.
- May be called twice for an async-load request: once for connector-token allocation and again after load completion when additional blocks are allocated.
build_connector_meta()- Scheduler-side method that builds metadata for the current scheduler step.
- The base API requires this method not to mutate
scheduler_output; calling it resets connector-side scheduler state.
on_new_request()- Optional scheduler hook called when a new request is added.
update_connector_output()- Called when the scheduler receives
KVConnectorOutputfrom worker-side connectors. - Carries finished send/receive sets, invalid block IDs, worker metadata, stats, and KV events back into scheduler-side connector state.
- Called when the scheduler receives
request_finished()- Scheduler-side method called exactly once before finished request blocks are freed.
- A connector can return
Trueto take responsibility for asynchronous block freeing.
request_finished_all_groups()- HMA-aware variant used when a request has block IDs from multiple KV cache groups.
MultiConnectorforwards this to connectors that implementSupportsHMA.
take_events()- Returns connector-collected KV events since the last call.
reset_cache()- Optional connector cache-reset hook used when scheduler reset paths need connector state cleanup.
shutdown()- Called during shutdown so connectors can finish async operations and release resources.
Prefill
Local source paths
The source paths below are relative to the local vLLM checkout used for this note.
vllm/v1/core/sched/scheduler.pyvllm/v1/core/kv_cache_manager.pyvllm/v1/core/kv_cache_coordinator.pyvllm/v1/core/single_type_kv_cache_manager.pyvllm/v1/core/block_pool.pyvllm/v1/worker/gpu/kv_connector.pyvllm/v1/worker/gpu_model_runner.pyvllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.pyvllm/distributed/kv_transfer/kv_connector/v1/nixl/scheduler.py
경고
Functions that are called but do not affect the KVCache flow are omitted. The sequence may be slightly inaccurate because unnecessary steps were removed.
Assumptions:
- LMCacheConnector is used.
- NixlConnector is used.
- Prefix cache is enabled.
- The scheduler path includes local prefix-cache lookup first, then external KV lookup through the connector.
- When the connector loads KV asynchronously, the request moves to
WAITING_FOR_REMOTE_KVSand its blocks are cached after the transfer result is processed.
get_computed_blocks(): finds local prefix-cache hits only when prefix caching is enabled and the request is allowed to read from prefix cache. A full prompt hit still recomputes the last token so logits can be produced.get_num_new_matched_tokens(): asks the connector how many additional tokens are externally computed and whether their KV must be loaded asynchronously. If the connector cannot answer yet, the scheduler skips the request for this step.num_new_tokens: for normal prefill, starts fromrequest.num_tokens - (num_local_computed_tokens + num_external_computed_tokens)and is then capped bylong_prefill_token_threshold, the current token budget, encoder-input scheduling, and Mamba block-aligned splitting when those features apply. For async KV load, vLLM setsnum_new_tokens = 0and allocates slots only for externally computed tokens.allocate_slots(): handles local computed blocks, externally computed tokens, new tokens, lookahead slots, encoder tokens, full-sequence admission, skipped-block removal for sliding-window attention, and delayed caching for async KV load.update_state_after_alloc(): passes the request, the currently allocated blocks, andnum_external_computed_tokensto the connector so it can decide the load or save plan for this scheduler step.WAITING_FOR_REMOTE_KVS: async KV load moves the request to this state. AfterKVConnectorOutput.finished_recvingarrives, the scheduler caches successfully received blocks and promotes the request back toWAITINGorPREEMPTED.update_from_output(): consumesKVConnectorOutput, includinginvalid_block_ids,finished_recving, andfinished_sending. Invalid blocks reduce the computed-token count so the affected portion can be recomputed.request_finished(): is called only when a request actually finishes. Before the connector receives the block IDs, vLLM removes skipped blocks. The connector may returndelay_free_blocks=True; in that case the scheduler frees blocks later, afterfinished_sendingis reported.free_blocks(): adds a block tofree_block_queuewhenblock.ref_cntis 0.