Sliding Window Attention KVCache Size
Sliding Window Attention KVCache Size
What Is Sliding Window Attention?
Sliding Window Attention (SWA) is a local-attention pattern where each token
attends only to a bounded window of nearby previous tokens instead of the full
prefix. In autoregressive LLM decoding, a token at position t with
sliding_window = 128 can attend to roughly tokens t - 127 through t, not
every token from the beginning of the prompt.
Full Attention:
token t attends to [0, 1, 2, ..., t]
Sliding Window Attention:
token t attends to [max(0, t - window + 1), ..., t]
This changes the attention cost from growing with the full sequence length to growing with the window size. The tradeoff is that an SWA layer cannot directly read information outside its local window. Models that use SWA often mix local SWA layers with occasional Full Attention layers, or rely on deep layer stacking, to keep longer-range information flowing through the network.
For KVCache, SWA matters because keys and values outside the active window are not needed by that SWA layer during decode. If the inference engine supports sliding-window KV eviction or bounded block allocation, SWA layers can use fewer cached token slots than Full Attention layers.
KVCache Size
KVCache size scales with both the number of attention layers and the number of cached token slots. For a typical attention layer, the KVCache size can be approximated as:
2(K,V) * cached_token_slots * num_kv_heads * head_dim * dtype_size
If an implementation allows head_size_v to differ from head_size, use
head_size + head_size_v instead of a single head_dim. For the whole model,
sum this value over every layer that owns KVCache.
total_kv_cache_bytes
~= sum(layer_cached_token_slots * per_token_kv_bytes_per_layer)
Full Attention layers can attend to all previous tokens during decode, so the number of cached token slots scales with sequence length.
Sliding Window Attention (SWA) layers only need the last window of tokens for attention computation. In an SWA-only model, every attention layer uses the same bounded context window, so the reduction ratio is roughly:
sliding_window / sequence_length
The ratio is independent of layer count, but the absolute cache size still scales linearly with the number of layers.
input_length = 1000
sliding_window = 128
num_layers = 32
Full Attention units ~= 32 * 1000 = 32000 layer-token units
SWA units ~= 32 * 128 = 4096 layer-token units
For a hybrid model that mixes Full Attention and SWA layers, calculate the cache by attention type:
total_layer_token_units
~= full_attention_layers * sequence_length
+ sliding_window_layers * sliding_window
For example, if 16 of 32 layers use Full Attention and 16 use SWA, with
input_length=1000 and sliding_window=128:
16 * 1000 + 16 * 128 = 18048 layer-token units
vLLM Implementation Notes
The source paths below are relative to the local vLLM checkout used for this note.
vllm/model_executor/layers/attention/attention.pyvllm/v1/kv_cache_interface.pyvllm/v1/core/kv_cache_utils.pyvllm/config/vllm.py
In vLLM, Attention.get_kv_cache_spec() returns SlidingWindowSpec when
self.sliding_window is set; otherwise it returns FullAttentionSpec.
FullAttentionSpec computes maximum memory usage from the number of blocks
needed for max_model_len, multiplied by page_size_bytes.
SlidingWindowSpec accounts for chunked prefill and block alignment. Its
admission bound is approximately:
max_blocks_per_request
= ceil(min(sliding_window - 1 + max_num_batched_tokens, max_model_len) / block_size)
+ 1
Therefore, if decode processes one new token at a time with
sliding_window=128 and block_size=16, vLLM may reserve up to 9 blocks, or
roughly 144 token slots, because of block alignment. This is still smaller than
a Full Attention cache for a 1000-token prompt.
In vLLM, the hybrid KV cache manager must be enabled for SWA layers to drop
KVCache outside the window. When --disable-hybrid-kv-cache-manager is active,
unify_hybrid_kv_cache_specs() converts SlidingWindowSpec to
FullAttentionSpec for hybrid models. In that mode, computation still uses
sliding window attention, but the KVCache memory reduction is not applied.
When --kv-transfer-config is set, vLLM disables the hybrid KV cache manager by
default, so verify whether the actual allocation behaves like Full Attention
when using a KVConnector.