Skip to content

[NPUW] Add block-based KV cache support for HFA and Pyramid attention#35014

Draft
intelgaoxiong wants to merge 1 commit intoopenvinotoolkit:masterfrom
intelgaoxiong:xiong/block_kv_pr3_hfa_decouple
Draft

[NPUW] Add block-based KV cache support for HFA and Pyramid attention#35014
intelgaoxiong wants to merge 1 commit intoopenvinotoolkit:masterfrom
intelgaoxiong:xiong/block_kv_pr3_hfa_decouple

Conversation

@intelgaoxiong
Copy link
Copy Markdown
Contributor

Details:

Extends Host Flash Attention (HFA) and Pyramid attention to operate with
the block-split KV cache produced by SplitKVCacheIntoBlocks (Part 1/4).

Section 1 — Shared infrastructure:

  • util.hpp/cpp: rename isPastKeyValuesKey/ValueisPastKeyParam/isPastValueParam; add isPastKeyParamContiguous / isPastValueParamContiguous
  • sdpa_utils.hpp/cpp: new file extracting shared SDPA parameter utilities (previously duplicated between HFA and Pyramid)
  • attention.hpp: extend SDPAIndices with past_key_blocks/past_value_blocks vectors; extend Attention struct with per-variant block indices

Section 2 — Host Flash Attention:

  • host_flash_attention.cpp: loop over all Concat inputs in build_sdpa_param_mapping() to collect _past_key_block_indices / _past_value_block_indices
  • base_sync_infer_request.cpp: replace scalar past_key/past_value checks with is_past_kv() lambda

Section 3 — Pyramid Attention:

  • pyramid_attention.cpp: add is_block_split path in process_pyramid_model() that shrinks each pyramid-variant Concat to idx past blocks; collect_concat_block_indices() helper
  • base_sync_infer_request.cpp: add block_mode + bind_block_ports() lambda in bind_pyramid_attention_inputs()
  • just_sync_infer_request.cpp: share_kv_block_buffers() for pyramid variants
  • partitioning/patterns/sdpa.cpp: relax Concat input-count check for multi-block inputs

This is part 3/4 of the block-based KV cache feature split.

Tickets:

AI Assistance:

  • AI assistance used: no / yes
  • If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks).

Extend Host Flash Attention (HFA) and Pyramid attention to operate with
block-split KV cache produced by SplitKVCacheIntoBlocks.

Section 1 - Shared infrastructure:
- util.hpp/cpp: rename isPastKeyValuesKey/Value to isPastKeyParam/isPastValueParam;
  add isPastKeyParamContiguous / isPastValueParamContiguous for non-block contexts
- sdpa_utils.hpp/cpp: new file, extract shared SDPA parameter utilities
  (previously duplicated between pyramid_attention and host_flash_attention)
- attention.hpp: extend SDPAIndices with past_key_blocks/past_value_blocks vectors;
  extend Attention struct with per-variant block indices for Pyramid

Section 2 - Host Flash Attention:
- host_flash_attention.cpp/hpp: loop over all Concat inputs in
  build_sdpa_param_mapping() to collect _past_key_block_indices /
  _past_value_block_indices; switch #include from pyramid_attention to sdpa_utils
- base_sync_infer_request.cpp: replace scalar past_key/past_value checks with
  is_past_kv() lambda that searches block-index vectors

Section 3 - Pyramid Attention:
- pyramid_attention.cpp/hpp: add is_block_split path in process_pyramid_model()
  that shrinks each pyramid-variant Concat to keep only idx past blocks;
  collect_concat_block_indices() helper; populate past_key/value_block_*_indices
- base_sync_infer_request.cpp: add block_mode + bind_block_ports() lambda in
  bind_pyramid_attention_inputs()
- just_sync_infer_request.cpp: share_kv_block_buffers() for pyramid variants
- partitioning/patterns/sdpa.cpp: relax Concat input-count check to support
  multi-block inputs

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant