Description
After the CPU Inference Optimization update (commit 112f853), running inference with i2_s quantization on ARM (aarch64) produces completely incoherent output — random tokens with no relation to the prompt. Rolling back to commit 404980e (the last commit before the optimization merge) restores correct, coherent output.
Environment
- Hardware: Raspberry Pi 5 (8GB RAM), ARM Cortex-A76 (aarch64)
- OS: Raspberry Pi OS 64-bit (Debian 12 Bookworm)
- Compiler: Debian clang version 18.1.8
- CMake: 3.25.1
- Python: 3.9 (conda)
- Model:
microsoft/BitNet-b1.58-2B-4T-gguf (ggml-model-i2_s.gguf)
- Quantization: i2_s
Steps to Reproduce
-
Clone repo at current HEAD (01eb415):
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
-
Generate kernels and build (following Adafruit guide):
python utils/codegen_tl1.py --model bitnet_b1_58-3B --BM 160,320,320 --BK 64,128,64 --bm 32,64,32
export CC=clang-18 CXX=clang++-18
rm -rf build && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
cd ..
-
Download model:
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
-
Run inference:
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -t 4 -cnv
Broken Output (HEAD - 01eb415)
> hi how are you
ri differentorefFly increase Hurtutar run following section underestimateAD Sachs weighedision
cann RICTS Reyn taskfir-ra mark filtr castWATCHB fr ret flatten missionuche purchase parameter
gramhit associatedyuraft runeded take compound sugar contrast unsubedom conveyuffanford...
Working Output (commit 404980e)
> hi
Hello! How can I assist you today?
> what is a raspi 5
The Raspberry Pi 5 is a next-generation model of the Raspberry Pi single-board computer series...
Performance on the working commit: 9.68 tokens/second (4 threads, ARM NEON).
Bisection
The regression was introduced in commit 112f853:
112f853 [feat] I2S kernels for weight & activation parallel on Intel & ARM machine;
[feat] I2S GEMV & GEMM(llama.cpp);
[feat] quantize activation & dequantize embedding(llama.cpp);
[fix] compile bug: cannot define __ARM_FEATURE_DOTPROD(llama.cpp)
The last known working commit is 404980e (one commit before 112f853).
Notes
- The build completes without errors on both commits — the issue is runtime behavior, not compilation.
ggml-bitnet-mad.cpp is compiled and linked in both cases.
- NEON is detected and enabled (
NEON = 1 in system_info output).
- DOTPROD detection:
GGML_COMPILER_SUPPORT_DOTPROD - Failed, but COMPILER_SUPPORTS_ARMV82_DOTPROD - Success.
- This issue also appears to affect other ARM64 platforms (Ampere/Hetzner CAX servers), not just Raspberry Pi.
- The Adafruit BitNet on Raspberry Pi guide (published Sept 2025, before the optimization commit) confirms working output on Pi 4 and Pi 5 with the older codebase.
Related to #411 — same root cause. Adding Pi 5 (Cortex-A76 with dotprod) as another confirmed affected platform.
Description
After the CPU Inference Optimization update (commit
112f853), running inference withi2_squantization on ARM (aarch64) produces completely incoherent output — random tokens with no relation to the prompt. Rolling back to commit404980e(the last commit before the optimization merge) restores correct, coherent output.Environment
microsoft/BitNet-b1.58-2B-4T-gguf(ggml-model-i2_s.gguf)Steps to Reproduce
Clone repo at current HEAD (
01eb415):Generate kernels and build (following Adafruit guide):
Download model:
Run inference:
Broken Output (HEAD -
01eb415)Working Output (commit
404980e)Performance on the working commit: 9.68 tokens/second (4 threads, ARM NEON).
Bisection
The regression was introduced in commit
112f853:The last known working commit is
404980e(one commit before112f853).Notes
ggml-bitnet-mad.cppis compiled and linked in both cases.NEON = 1in system_info output).GGML_COMPILER_SUPPORT_DOTPROD - Failed, butCOMPILER_SUPPORTS_ARMV82_DOTPROD - Success.Related to #411 — same root cause. Adding Pi 5 (Cortex-A76 with dotprod) as another confirmed affected platform.