ExLlamaV3 quantizations of Qwen3-235B-A22B-Instruct-2507 with tensor-level (L3) optimization and boosted attention layers (5 bit). Maximum effort applied towards the goal of achieving the best possible quantizations at the expense of time and compute.
Using this measurement.json file and the base quants provided, additional highly-optimized quantizations can be made in seconds at any reasonable bpw by anyone. All work done with ExLlamaV3 v0.0.18.
Optimized
| Size | bpw | Target | PPL | vs 2.0 | vs 3.0 | |
|---|---|---|---|---|---|---|
| 2.26bpw-h6-opt | 66 GB | 2.26 | 72GB @ 64k | 4.54 | β18% | +12% |
| 2.44bpw-h6-opt | 69 GB | 2.44 | 96GB @ 256k | 4.33 | β22% | +7% |
| 2.89bpw-h6-opt | 81 GB | 2.89 | 96GB @ 128k | 4.02 | β28% | β0.5% |
| 3.06bpw-h6-opt | 86 GB | 3.06 | 96GB @ 64k | 3.94 | β29% | β2.5% |
| 3.93bpw-h6-opt | 109 GB | 3.93 | β | β | β | β |
| 4.68bpw-h6-opt | 129 GB | 4.68 | β | β | β | β |
Base
| Size | bpw | PPL | |
|---|---|---|---|
| 2.0bpw-h6 | 57 GB | 2.0 | 5.57 |
| 3.0bpw-h6 | 84 GB | 3.0 | 4.04 |
| 4.0bpw-h6 | 112 GB | 4.0 | β |
| 5.0bpw-h6 | 139 GB | 5.0 | β |
| 6.0bpw-h6 | 166 GB | 6.0 | β |
Methodology
Optimized quants use exl3's measure.py β optimize.py β recompile.py pipeline. Attention layers replaced with 5bpw precision post-optimization.
Perplexity by optimization stage:
| Target | Pre-attn bpw | PPL | +5bpw attn | Final bpw | Size |
|---|---|---|---|---|---|
| 72GB @ 64k | 2.20 | 4.72 | 4.54 | 2.26 | 66 GB |
| 96GB @ 256k | 2.41 | 4.48 | 4.33 | 2.44 | 69 GB |
| 96GB @ 128k | 2.86 | 4.20 | 4.02 | 2.89 | 81 GB |
| 96GB @ 64k | 3.00 | 4.14 | 3.94 | 3.06 | 86 GB |
Cost vs gain (relative to 2.0 base @ 57 GB, PPL 5.57):
| Target | Final bpw | Size | +GB | PPL | Ξ PPL | PPL/GB |
|---|---|---|---|---|---|---|
| 72GB @ 64k | 2.26 | 66 GB | +9 | 4.54 | β1.03 | β0.11 |
| 96GB @ 256k | 2.44 | 69 GB | +12 | 4.33 | β1.24 | β0.10 |
| 96GB @ 128k | 2.89 | 81 GB | +24 | 4.02 | β1.55 | β0.06 |
| 96GB @ 64k | 3.06 | 86 GB | +29 | 3.94 | β1.63 | β0.06 |
Attention boost adds 0.03-0.06 bpw (2-3 GB) for meaningful PPL gains. Diminishing returns between 4bpw and 5bpw attention at higher base bitrates.
Lower PPL is better. vs columns show % change from base 2.0 (PPL 5.57) and base 3.0 (PPL 4.04). Measured at 2k context.
Model tree for amanwalksdownthestreet/Qwen3-235B-A22B-Instruct-2507-exl3
Base model
Qwen/Qwen3-235B-A22B-Instruct-2507