Skip to content

[Qwen3-VL] feature-extraction export fails: unordered_map::at crash due to torch.vmap in masking logic #3287

@OntosAI

Description

@OntosAI

I am attempting to export the Qwen3-VL-Embedding-2B model to OpenVINO. This blocks the deployment of the new multimodal embedding capabilities.
I have encountered two blockers:
Optimum-CLI limitation: feature-extraction task is not yet supported for qwen3_vl architecture.
OpenVINO conversion crash: Manual conversion using ov.convert_model fails with RuntimeError: unordered_map::at. The traceback indicates an issue tracing torch.vmap operations within transformers.masking_utils (specifically _vmap_for_bhqkv), even when attn_implementation="eager" is explicitly set.
To Reproduce

Environment:
openvino==2025.3
torch==2.8.0
transformers (Qwen3-VL branch/latest)
optimum-intel(git+https://github.com/openvino-dev-samples/optimum.git@qwen3vl)
Minimal Reproduction Script:
code
Python :
import torch
import openvino as ov
from transformers import AutoModel, AutoProcessor
from PIL import Image

model_id = "Qwen/Qwen3-VL-Embedding-2B"

Load with eager attention to attempt disabling FlashAttn/SDPA optimization

model = AutoModel.from_pretrained(
model_id,
trust_remote_code=True,
attn_implementation="eager",
device_map="cpu"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Prepare multimodal dummy input

dummy_image = Image.new('RGB', (28, 28), color='black')
dummy_text = "<|image_pad|>Describe this image."
inputs = processor(text=[dummy_text], images=[dummy_image], return_tensors="pt")

Wrapper to align with OpenVINO input expectations

class Wrapper(torch.nn.Module):
def init(self, model):
super().init()
self.model = model
def forward(self, input_ids, attention_mask, pixel_values, image_grid_thw):
return self.model(
input_ids=input_ids, attention_mask=attention_mask,
pixel_values=pixel_values, image_grid_thw=image_grid_thw,
output_hidden_states=True
).last_hidden_state

Crash happens here

ov_model = ov.convert_model(
Wrapper(model),
example_input=(inputs.input_ids, inputs.attention_mask, inputs.pixel_values, inputs.image_grid_thw)
)
Relevant Traceback
The error occurs deep within the PyTorch frontend when handling the vectorized masking logic:
Traceback (most recent call last):
...
File ".../transformers/masking_utils.py", line 392, in sdpa_mask_recent_torch
causal_mask = _vmap_for_bhqkv(mask_function)(batch_arange, head_arange, cache_position, kv_arange)
...
File ".../torch/_functorch/vmap.py", line 484, in _flat_vmap
batched_outputs = func(*batched_inputs, **kwargs)
...
File ".../openvino/frontend/pytorch/ts_decoder.py", line 84, in init
raise RuntimeError(
RuntimeError: Couldn't get TorchScript module by tracing.
Exception:
unordered_map::at
Request
Please add support for task="feature-extraction" for qwen3_vl in Optimum Intel.
Fix the OpenVINO PyTorch frontend to handle (or correctly bypass) torch.vmap / functorch constructs used in the new Transformers masking implementation, or provide a workaround to strictly disable these paths during tracing.

Metadata

Metadata

Labels

PSEEscalate to PSE for further investigate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions