[NPUW] Add block-based KV cache support for HFA and Pyramid attention by intelgaoxiong · Pull Request #35014 · openvinotoolkit/openvino

intelgaoxiong · 2026-03-29T08:35:09Z

Details:

Extends Host Flash Attention (HFA) and Pyramid attention to operate with
the block-split KV cache produced by SplitKVCacheIntoBlocks (Part 1/4).

Section 1 — Shared infrastructure:

util.hpp/cpp: rename isPastKeyValuesKey/Value → isPastKeyParam/isPastValueParam; add isPastKeyParamContiguous / isPastValueParamContiguous
sdpa_utils.hpp/cpp: new file extracting shared SDPA parameter utilities (previously duplicated between HFA and Pyramid)
attention.hpp: extend SDPAIndices with past_key_blocks/past_value_blocks vectors; extend Attention struct with per-variant block indices

Section 2 — Host Flash Attention:

host_flash_attention.cpp: loop over all Concat inputs in build_sdpa_param_mapping() to collect _past_key_block_indices / _past_value_block_indices
base_sync_infer_request.cpp: replace scalar past_key/past_value checks with is_past_kv() lambda

Section 3 — Pyramid Attention:

pyramid_attention.cpp: add is_block_split path in process_pyramid_model() that shrinks each pyramid-variant Concat to idx past blocks; collect_concat_block_indices() helper
base_sync_infer_request.cpp: add block_mode + bind_block_ports() lambda in bind_pyramid_attention_inputs()
just_sync_infer_request.cpp: share_kv_block_buffers() for pyramid variants
partitioning/patterns/sdpa.cpp: relax Concat input-count check for multi-block inputs

This is part 3/4 of the block-based KV cache feature split.

Tickets:

EISW-206740

AI Assistance:

AI assistance used: no / yes
If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks).

Extend Host Flash Attention (HFA) and Pyramid attention to operate with block-split KV cache produced by SplitKVCacheIntoBlocks. Section 1 - Shared infrastructure: - util.hpp/cpp: rename isPastKeyValuesKey/Value to isPastKeyParam/isPastValueParam; add isPastKeyParamContiguous / isPastValueParamContiguous for non-block contexts - sdpa_utils.hpp/cpp: new file, extract shared SDPA parameter utilities (previously duplicated between pyramid_attention and host_flash_attention) - attention.hpp: extend SDPAIndices with past_key_blocks/past_value_blocks vectors; extend Attention struct with per-variant block indices for Pyramid Section 2 - Host Flash Attention: - host_flash_attention.cpp/hpp: loop over all Concat inputs in build_sdpa_param_mapping() to collect _past_key_block_indices / _past_value_block_indices; switch #include from pyramid_attention to sdpa_utils - base_sync_infer_request.cpp: replace scalar past_key/past_value checks with is_past_kv() lambda that searches block-index vectors Section 3 - Pyramid Attention: - pyramid_attention.cpp/hpp: add is_block_split path in process_pyramid_model() that shrinks each pyramid-variant Concat to keep only idx past blocks; collect_concat_block_indices() helper; populate past_key/value_block_*_indices - base_sync_infer_request.cpp: add block_mode + bind_block_ports() lambda in bind_pyramid_attention_inputs() - just_sync_infer_request.cpp: share_kv_block_buffers() for pyramid variants - partitioning/patterns/sdpa.cpp: relax Concat input-count check to support multi-block inputs Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPUW] Add block-based KV cache support for HFA and Pyramid attention#35014

[NPUW] Add block-based KV cache support for HFA and Pyramid attention#35014
intelgaoxiong wants to merge 1 commit intoopenvinotoolkit:masterfrom
intelgaoxiong:xiong/block_kv_pr3_hfa_decouple

intelgaoxiong commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

intelgaoxiong commented Mar 29, 2026

Details:

Tickets:

AI Assistance:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant