·
2918 commits
to develop
since this release
Post-training Quantization:
- Breaking changes:
- Renamed
nncf.CompressWeightsMode.CB4_F8E4M3mode option tonncf.CompressWeightsMode.CB4.
- Renamed
- General:
- Added
nncf.pruneAPI function, which provides a unified interface for pruning algorithms. Currently available for PyTorch backend and supports Magnitude Pruning.
More details about the new API can be found in the documentation. - Added
nncf.build_graphAPI function for buildingNNCFGraphfrom a model. This API can be used to inspect and define the ignored scope. - Added documentation about using
nncf.IgnoredScope. - Reworked
HWConfig, now using Python-style definition of hardware configuration instead of JSON files.
- Added
- Features:
- Added support for models containing MatMul operations with transposed activation inputs in data-free Weight Compression and data-aware AWQ algorithms.
- (OpenVINO) Introduced new experimental compression data type ADAPTIVE_CODEBOOK. This compression type calculates a unique codebook for each MatMul or block of identical MatMuls (for example, all down_proj could have the same codebook). This approach reduces quality degradation in the case of per-channel weight compression. See example.
- (TorchFX) Preview support for the new
compress_pt2eAPI has been introduced, enabling quantization oftorch.fx.GraphModulemodels with theOpenVINOQuantizer. Users now can quantize their models in ExecuTorch for the OpenVINO backend via the nncfcompress_pt2eemploying Scale Estimation and AWQ. - (PyTorch) Added support for linear functions for the Fast Bias Correction algorithm to improve the accuracy of such models after the quantization.
- (OpenVINO) Added activation profiler tool to collect and visualize tensor statistics.
- Fixes:
- (ONNX) Fixed
compress_quantize_weights_transformation()method by removing names of deleted initializers from graph inputs. - (ONNX) Fixed incorrect insertion of MatMulNBits nodes.
- (ONNX) Fixed
- Improvements:
- Added support for the compression of 3D weights in AWQ, Scale Estimation, and GPTQ algorithms. Models with MoE (Mixture of Experts), such as GPT-OSS-20B and Qwen3-30B-A3B, can be compressed with data-aware methods now.
- Tutorials:
- Post-Training Quantization of YOLO26 OpenVINO Model
- Post-Training Optimization of Wan2.2 Model
- Post-Training Optimization of DeepSeek-OCR Model
- Post-Training Optimization of Z-Image-Turbo Model
- Post-Training Optimization of Qwen-Image Model
- Post-Training Optimization of Qwen3-TTS Model
- Post-Training Optimization of Qwen3-ASR Model
- Post-Training Optimization of Fun-ASR-Nano Model
- Post-Training Optimization of Fun-CosyVoice 3.0 Model
Deprecations/Removals:
- (TensorFlow) Removed support for TensorFlow backend.
- (PyTorch) Removed legacy
create_compressed_modelAPI for PyTorch backend, which was previously marked as deprecated. - (PyTorch) Removed legacy algorithms for PyTorch that were based on using
NNCFNetwork: NAS, Structural Pruning, AutoML, Knowledge Distillation, Mixed-Precision Quantization, and Movement Sparsity.
Requirements:
- Dropped
jsonschema,natsort, andpymoofrom dependencies as they are no longer required. - Updated
numpyto>=1.24.0, <2.5.0.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@avolkov-intel @Shehrozkashif @ruro @mostafafaheem