Experimental global target bits‑per‑weight quantization of allenai/Olmo-3-7B-Think

Using non-standard (forked) LLaMA C++ release b7540 for quantization.

From the original model creators:

Model Details

Model Card for Olmo 3 Think

We introduce Olmo 3, a new family of 7B and 32B models both Instruct and Think variants. Long chain-of-thought thinking improves reasoning tasks like math and coding.

Olmo is a series of Open language models designed to enable the science of language models. These models are pre-trained on the Dolma 3 dataset and post-trained on the Dolci datasets. We are releasing all code, checkpoints, logs (coming soon), and associated training details.

The core models released in this batch include the following:

Stage Olmo 3 7B Think Olmo 3 32B Think Olmo 3 7B Instruct

Base Model Olmo-3-7B Olmo-3-32B Olmo-3-7B

SFT Olmo-3-7B-Think-SFT Olmo-3-32B-Think-SFT Olmo-3-7B-Instruct-SFT

DPO Olmo-3-7B-Think-DPO Olmo-3-32B-Think-DPO Olmo-3-7B-Instruct-DPO

Final Models (RLVR) Olmo-3-7B-Think Olmo-3-32B-Think Olmo-3-7B-Instruct

Stage	Olmo 3 7B Think	Olmo 3 32B Think	Olmo 3 7B Instruct
Base Model	Olmo-3-7B	Olmo-3-32B	Olmo-3-7B
SFT	Olmo-3-7B-Think-SFT	Olmo-3-32B-Think-SFT	Olmo-3-7B-Instruct-SFT
DPO	Olmo-3-7B-Think-DPO	Olmo-3-32B-Think-DPO	Olmo-3-7B-Instruct-DPO
Final Models (RLVR)	Olmo-3-7B-Think	Olmo-3-32B-Think	Olmo-3-7B-Instruct

⚠️ PLEASE READ THIS BEFORE USING THESE EXPERIMENTAL VERSIONS! ⚠️

An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but my focus has been primarily on quantization and pruning.

The method to produce these experimental versions involves using a custom version of llama-imatrix to generate an imatrix including the mean activations, and a custom version of llama-quantize, which computes a per-tensor weighted mean squared quantization error and a bias/projection term (if the imatrix includes activations), to automatically select the lowest error quantization recipe that achieves a global target bits‑per‑weight (bpw). More details on the implementation and test results here

There are two pull requests (#14891 & #15550) to merge these changes back into the core llama.cpp project. This may or may not ever happen so, until then, the modified versions will be available on GitHub.

For testing and comparison, I use models produced by Bartowski (see credits below) and Unsloth (Daniel and Michael Han do some really interesting stuff!) but when they don't provide versions of the required model, tests and comparisons are against standard quantization obtained by simply running llama-quantize with no further optimizations.

All experimental versions were generated using an appropriate imatrix created from datasets available at eaddario/imatrix-calibration. In llama.cpp, an imatrix is a calibration file derived from running representative text through the model and collecting activation statistics. It is used to weight quantization error so that error in more “important” directions (as estimated from activations) is penalized more heavily.

The process to generate these models is roughly as follows:

Convert the original model's safetensors to GGUF F16*
Estimate the Perplexity score for the F16 model (baseline) using the wikitext-2-raw-v1 dataset, and save the logits
Generate an imatrix from the most appropriate calibration dataset
Quantize the baseline model targeting a bpw average, allocating more bits to tensors estimated to matter more (e.g. llama-quantize --target-bpw 4.5678 --keep-bpw-state --imatrix imatrix.gguf baseline-model-F16.gguf 12)
Quantize the baseline model targeting a bpw average, treating each tensor equally instead of prioritizing some (e.g. llama-quantize --target-bpw 4.5678 --no-importance --keep-bpw-state --imatrix imatrix.gguf baseline-model-F16.gguf 12)
Calculate Perplexity, KL Divergence, ARC (Easy+Challenge), HellaSwag, MMLU, Truthful QA and WinoGrande scores for each quantized model
Keep version with the best 𝜌PPL scores (i.e. highest Cor(ln(PPL(Q)), ln(PPL(base))))
Repeat until all desired quants are created

*BF16 would be preferred, but F16 performs better on Apple's GPUs

Advantages and disadvantages of the global target bits‑per‑weight quantization process

Advantages

Target arbitrary size models
- When specifying --target-bpw 4.5678 for instance, the algorithm will produce a model (nearly) exactly of that size, which is very useful for maximizing VRAM usage. In a system with 24GB VRAM and a 70B model, standard quants might produce a 16.8GB file (too small, quality left on table) or a 24.1GB file (won't fit). This approach can generate a 23.85GB file to utilize the hardware fully.
Data-driven mixed precision often can improve quality at fixed size
- Instead of using hardcoded heuristics (e.g. make attn_v Q5_K for a 70B model), that may be sub‑optimal for a given architecture or size, the quantization mix is determined by the actual error sensitivity of the specific model's weights. This, in practice, often yields a better quality/size trade-off, especially in aggressive quantization scenarios (1.5 to 3.5 bpw), or for unusual architectures.
- Please note: llama.cpp’s heuristics have been tuned across many models and are highly optimized; although the target bpw method produces better quality often (>75% based on tests with 130 models from 11 different families), it can also lose in surprising cases.
Allows better like-for-like comparisons between models and families
- Standard llama.cpp quantization uses hardcoded rules like: "use Q4_K_M, except bump some tensors up/down, except fall back if incompatible, except keep some tensors unquantized..." and for that reason, two different models quantized with the same Q4_K_M type can end up with very different bpw (e.g. 4.75 and 4.30).
- All things being equal, the performance of a model is usually proportional to its overall bpw size; models with a higher bpw tend to perform better than lower bpw models. Since model A has simply been given more bits, it will typically perform better (lower perplexity, better eval scores, etc.) even if the underlying quantization method is identical. That makes comparing the performance not a controlled experiment, because the comparison is between models with different effective compression ratios.
- --target-bpw tries to address that by making the experiment more controlled: each model gets quantized to land on (approximately) the same global byte budget, so that the models' performance differences are more attributable to architecture/training differences, quantization error behaviour at the same compression ratio, optimizer’s allocation decisions, etc.

Disadvantages

Quantization process is significantly slower than standard
- This approach can take 5x-10x longer as it quantizes a sample of most tensors into 15 different formats, dequantizes them back to floats, computes error diffs, and selects the best size/error option that fits the global bpw budget.
- However, the --keep-bpw-state option will save the above-mentioned computations to disk so that future quantizations, in the permissible bpw range for the same model, can be generated at normal speed. It also allows to interrupt the computation process and resume it at a later time.
The optimization target is only a proxy for the model's performance quality
- The process minimizes a per-tensor estimated error computed from sampled rows, not actual perplexity or divergence of output distributions (a future version may address this). Since errors interact nonlinearly across layers, there are no guarantees it will select the best possible quantization recipe subject to the bpw size constraint.
- Furthermore, the process can operate in two modes: giving priority to important tensors (default) or treating each tensor equally (setting the --no-importance option). To my knowledge, there is no computationally feasible way to determine ahead of time which modality will yield better results, and two runs per model may be needed to obtain the best quality, but the default mode usually wins.
An imatrix with activations data is required for best results
- Activation data is required to compute the bias factor (i.e. the systematic error projected onto activation directions). If the imatrix file does not contain activation data, the quantization recipe will likely be sub-optimal.

Models

Bits per weight, size, perplexity and KL Divergence scores

Model	BPW	Size (GB)	μPPL	𝜌PPL	μKLD	Same Top-P
Olmo-3-7B-Think-F16	16.0012	14.6	11.162605 ±0.086283	100%	N/A	N/A
Olmo-3-7B-Think-IQ1_L	1.7498	1.6	57.114920 ±0.481183	71.71%	1.932969 ±0.004000	42.107 ± 0.130
Olmo-3-7B-Think-IQ2_S	2.2496	2.1	17.851117 ±0.138804	88.82%	0.656234 ±0.002024	64.906 ± 0.126
Olmo-3-7B-Think-IQ2_XS	2.1246	1.9	20.098228 ±0.158277	86.41%	0.801333 ±0.002343	61.656 ± 0.128
Olmo-3-7B-Think-IQ2_XXS	1.9998	1.8	22.071794 ±0.175606	84.92%	0.904556 ±0.002543	58.919 ± 0.130
Olmo-3-7B-Think-IQ3_XXS	2.9996	2.7	12.856732 ±0.098102	95.66%	0.248853 ±0.000951	77.616 ± 0.110
Olmo-3-7B-Think-Q2_K	2.4997	2.3	15.812446 ±0.123183	91.36%	0.512201 ±0.001667	67.693 ± 0.123
Olmo-3-7B-Think-Q3_K_L	3.7498	3.4	11.734804 ±0.090876	98.67%	0.076071 ±0.000331	86.811 ± 0.089
Olmo-3-7B-Think-Q3_K_S	3.2500	3.0	12.505825 ±0.099056	97.07%	0.164489 ±0.000662	81.113 ± 0.103
Olmo-3-7B-Think-Q3_K	3.4995	3.2	11.810130 ±0.091388	98.13%	0.108321 ±0.000461	84.766 ± 0.095
Olmo-3-7B-Think-Q4_K_S	4.2497	3.9	11.365788 ±0.088115	99.30%	0.038218 ±0.000191	90.558 ± 0.077
Olmo-3-7B-Think-Q4_K	4.4999	4.1	11.320928 ±0.087842	99.57%	0.022840 ±0.000119	92.493 ± 0.069
Olmo-3-7B-Think-Q4_K_M-bartowski	4.8978	4.5	11.546397 ±0.088682	99.01%	0.055111 ±0.000236	88.594 ± 0.084
Olmo-3-7B-Think-Q4_K_M-unsloth	4.8978	4.5	11.552669 ±0.088685	98.99%	0.055783 ±0.000242	88.543 ± 0.084
Olmo-3-7B-Think-Q4_K_M-bpw	4.8974	4.5	11.271116 ±0.087255	99.69%	0.016039 ±0.000093	93.809 ± 0.064
Olmo-3-7B-Think-Q5_K_S	5.2495	4.8	11.247949 ±0.087097	99.78%	0.011865 ±0.000061	94.565 ± 0.060
Olmo-3-7B-Think-Q5_K	5.4995	5.0	11.214191 ±0.086836	99.84%	0.007804 ±0.000044	95.578 ± 0.054
Olmo-3-7B-Think-Q6_K	6.4992	5.9	11.193815 ±0.086752	99.92%	0.003559 ±0.000025	97.031 ± 0.045
Olmo-3-7B-Think-Q8_0	8.4990	7.8	11.173449 ±0.086545	99.97%	0.000384 ±0.000005	98.997 ± 0.026

ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores

Scores generated using llama-perplexity with 750 tasks per test, and a context size of 768 tokens.

For the test data used in the generation of these scores, follow the appropriate links: HellaSwag, ARC, MMLU, Truthful QA and WinoGrande

Model	ARC	HellaSwag	MMLU	Truthful QA	WinoGrande	Avg Score
Olmo-3-7B-Think-IQ1_L	47.3333	49.6000	28.1333	28.1333	58.0000	42.2400
Olmo-3-7B-Think-IQ2_S	58.8000	62.6666	33.0667	29.4667	62.6667	49.3333
Olmo-3-7B-Think-IQ2_XS	57.7333	59.7333	32.1333	30.2667	61.7333	48.3200
Olmo-3-7B-Think-IQ2_XXS	57.7333	57.8666	31.8667	28.9333	62.2667	47.7333
Olmo-3-7B-Think-IQ3_XXS	63.6000	68.6666	34.0000	31.4667	65.6000	52.6667
Olmo-3-7B-Think-Q2_K	61.7333	66.0000	36.5333	32.4000	64.4000	52.2133
Olmo-3-7B-Think-Q3_K_L	65.7333	73.7333	35.8667	31.4667	64.9333	54.3467
Olmo-3-7B-Think-Q3_K_S	65.2000	72.2666	35.7333	31.4667	67.6000	54.4533
Olmo-3-7B-Think-Q3_K	64.9333	72.4000	34.8000	31.8667	67.3333	54.2667
Olmo-3-7B-Think-Q4_K_S	65.6000	73.8666	35.6000	31.6000	68.0000	54.9333
Olmo-3-7B-Think-Q4_K	65.7333	73.4666	35.7333	30.9333	68.4000	54.8533
Olmo-3-7B-Think-Q4_K_M-bartowski	66.1333	74.2666	35.3333	31.7333	68.0000	55.0933
Olmo-3-7B-Think-Q4_K_M-unsloth	66.8000	74.0000	35.2000	32.6667	67.7333	55.2800
Olmo-3-7B-Think-Q4_K_M-bpw	66.4000	74.2666	36.0000	31.7333	67.8667	55.2533
Olmo-3-7B-Think-Q5_K_S	66.2667	74.5333	36.0000	31.6000	68.1333	55.3067
Olmo-3-7B-Think-Q5_K	66.5333	74.4000	36.0000	32.1333	68.4000	55.4933
Olmo-3-7B-Think-Q6_K	66.8000	74.4000	36.5333	32.6667	68.5333	55.7867
Olmo-3-7B-Think-Q8_0	66.5333	74.1333	36.1333	32.5333	68.9333	55.6533

Tokens per second benchmarks

Scores generated using llama-bench. Standard (llama-quantize with no optimization) Q4_K_M quantization included for comparison.

model	size	params	backend	threads	test	t/s
Olmo-3-7B-Think-Q4_K_M-bpw	4.16 GiB	7.30 B	Metal,BLAS	12	pp512	915.58 ± 4.27
Olmo-3-7B-Think-Q4_K_M-bpw	4.16 GiB	7.30 B	Metal,BLAS	12	tg128	75.81 ± 0.17
Olmo-3-7B-Think-Q4_K_M-bpw	4.16 GiB	7.30 B	Metal,BLAS	12	pp1024+tg1024	114.24 ± 2.12
Olmo-3-7B-Think-Q4_K_M-bartowski	4.16 GiB	7.30 B	Metal,BLAS	12	pp512	896.12 ± 11.08
Olmo-3-7B-Think-Q4_K_M-bartowski	4.16 GiB	7.30 B	Metal,BLAS	12	tg128	85.38 ± 0.41
Olmo-3-7B-Think-Q4_K_M-bartowski	4.16 GiB	7.30 B	Metal,BLAS	12	pp1024+tg1024	129.36 ± 0.74
Olmo-3-7B-Think-Q4_K_M-unsloth	4.16 GiB	7.30 B	Metal,BLAS	12	pp512	930.46 ± 1.38
Olmo-3-7B-Think-Q4_K_M-unsloth	4.16 GiB	7.30 B	Metal,BLAS	12	tg128	85.49 ± 0.82
Olmo-3-7B-Think-Q4_K_M-unsloth	4.16 GiB	7.30 B	Metal,BLAS	12	pp1024+tg1024	129.24 ± 0.90

Metrics used

Perplexity: one of the key metrics used in NLP evaluation. It measures the quality of a language model by evaluating how well it predicts the next token given a particular sequence of words. A PPL of 1 indicates an exact match between predicted and actual, whereas values greater than one indicate a degree of "surprise" the generated token differs from the expected.

Kullback–Leibler (KL) Divergence: a statistical measure of how much a probability distribution differs from another. When quantizing models (or altering the original tensors in any way for that matter), the closest we can preserve the weights' probability distribution to the original model the better, thus the closest to 0 the better.

AI2 Reasoning Challenge (ARC): a benchmark to evaluate the ability of AI models to answer complex science questions that require logical reasoning beyond pattern matching.

HellaSwag: the Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations (bit of a mouthful!) is a benchmark designed to test commonsense natural language inference. It requires the model to predict the most likely ending of a sentence.

MMLU: the Massive Multitask Language Understanding evaluates LLMs’ general knowledge and problem-solving abilities across 57 subjects, including elementary mathematics, US history, computer science, and law.

Truthful QA: evaluates how well LLMs generate truthful responses to questions. It identifies whether AI models can avoid generating false or misleading information, particularly in areas where human knowledge is prone to misconceptions.

Winogrande: based on the Winograd Schema Challenge, is a natural language understanding task requiring models to resolve ambiguities in sentences involving pronoun references.

Credits

LLaMa C++ has a large and vibrant community of contributors (~1,200 last time I checked) that actively maintain and extend its functionality, adding new models and architectures almost as fast as they appear. Considering the breakneck speed at which the AI/ML field is advancing, this alone is a remarkable feat!

While I'm grateful to all contributors, I want to recognise three in particular:

Colin Kealty, for the many contributions and for being one of the best sources of high quality quantized models available on Hugging Face
Georgi Gerganov for his amazing work with llama.cpp and the ggml/gguf libraries
Iwan Kawrakow for being one of the key authors behind the many quantization algorithms and the imatrix functionality.

Downloads last month: 326

GGUF

Model size

7B params

Architecture

olmo2

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

View +2 variants

Model tree for eaddario/Olmo-3-7B-Think-GGUF

Base model

allenai/Olmo-3-1025-7B

Finetuned

allenai/Olmo-3-7B-Think-SFT

Finetuned

allenai/Olmo-3-7B-Think-DPO

Finetuned

allenai/Olmo-3-7B-Think

Quantized

(31)

this model

eaddario
/

Olmo-3-7B-Think-GGUF

Experimental global target bits‑per‑weight quantization of allenai/Olmo-3-7B-Think

Model Details

Model Card for Olmo 3 Think

⚠️ PLEASE READ THIS BEFORE USING THESE EXPERIMENTAL VERSIONS! ⚠️

Advantages and disadvantages of the global target bits‑per‑weight quantization process

Advantages

Disadvantages

Models

Bits per weight, size, perplexity and KL Divergence scores

ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores

Tokens per second benchmarks

Metrics used

Credits

Model tree for eaddario/Olmo-3-7B-Think-GGUF

Dataset used to train eaddario/Olmo-3-7B-Think-GGUF