ExLlamaV3 quantizations of Qwen3-235B-A22B-Instruct-2507 with tensor-level (L3) optimization and boosted attention layers (5 bit). Maximum effort applied towards the goal of achieving the best possible quantizations at the expense of time and compute.

Using this measurement.json file and the base quants provided, additional highly-optimized quantizations can be made in seconds at any reasonable bpw by anyone. All work done with ExLlamaV3 v0.0.18.

Optimized

Size bpw Target PPL vs 2.0 vs 3.0
2.26bpw-h6-opt 66 GB 2.26 72GB @ 64k 4.54 βˆ’18% +12%
2.44bpw-h6-opt 69 GB 2.44 96GB @ 256k 4.33 βˆ’22% +7%
2.89bpw-h6-opt 81 GB 2.89 96GB @ 128k 4.02 βˆ’28% βˆ’0.5%
3.06bpw-h6-opt 86 GB 3.06 96GB @ 64k 3.94 βˆ’29% βˆ’2.5%
3.93bpw-h6-opt 109 GB 3.93 β€” β€” β€” β€”
4.68bpw-h6-opt 129 GB 4.68 β€” β€” β€” β€”

Base

Size bpw PPL
2.0bpw-h6 57 GB 2.0 5.57
3.0bpw-h6 84 GB 3.0 4.04
4.0bpw-h6 112 GB 4.0 β€”
5.0bpw-h6 139 GB 5.0 β€”
6.0bpw-h6 166 GB 6.0 β€”

Methodology

Optimized quants use exl3's measure.py β†’ optimize.py β†’ recompile.py pipeline. Attention layers replaced with 5bpw precision post-optimization.

Perplexity by optimization stage:

Target Pre-attn bpw PPL +5bpw attn Final bpw Size
72GB @ 64k 2.20 4.72 4.54 2.26 66 GB
96GB @ 256k 2.41 4.48 4.33 2.44 69 GB
96GB @ 128k 2.86 4.20 4.02 2.89 81 GB
96GB @ 64k 3.00 4.14 3.94 3.06 86 GB

Cost vs gain (relative to 2.0 base @ 57 GB, PPL 5.57):

Target Final bpw Size +GB PPL Ξ” PPL PPL/GB
72GB @ 64k 2.26 66 GB +9 4.54 βˆ’1.03 βˆ’0.11
96GB @ 256k 2.44 69 GB +12 4.33 βˆ’1.24 βˆ’0.10
96GB @ 128k 2.89 81 GB +24 4.02 βˆ’1.55 βˆ’0.06
96GB @ 64k 3.06 86 GB +29 3.94 βˆ’1.63 βˆ’0.06

Attention boost adds 0.03-0.06 bpw (2-3 GB) for meaningful PPL gains. Diminishing returns between 4bpw and 5bpw attention at higher base bitrates.

Lower PPL is better. vs columns show % change from base 2.0 (PPL 5.57) and base 3.0 (PPL 4.04). Measured at 2k context.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for amanwalksdownthestreet/Qwen3-235B-A22B-Instruct-2507-exl3

Quantized
(57)
this model