Experimental global target bits‑per‑weight quantization of ServiceNow-AI/Apriel-1.6-15b-Thinker

Using non-standard (forked) LLaMA C++ release b7520 for quantization.

Original model: ServiceNow-AI/Apriel-1.6-15b-Thinker

From the original model creators:

Summary

Apriel-1.6-15B-Thinker is an updated multimodal reasoning model in ServiceNow’s Apriel SLM series, building on Apriel-1.5-15B-Thinker. With significantly improved text and image reasoning capabilities, Apriel-1.6 achieves competitive performance against models up to 10x its size. Like its predecessor, it benefits from extensive continual pre-training across both text and image domains. We additionally perform post-training that focuses on Supervised Finetuning (SFT) and Reinforcement Learning (RL). Apriel-1.6 obtains frontier performance without sacrificing reasoning token efficiency. The model improves or maintains task performance when compared with Apriel-1.5-15B-Thinker, while reducing reasoning token usage by more than 30%.

Highlights

  • Achieves a score of 57 on the Artificial Analysis index outperforming models like Gemini 2.5 Flash, Claude Haiku 4.5 and GPT OSS 20b. It obtains a score on par with Qwen3 235B A22B, while being significantly more efficient.
  • Reduces reasoning token usage by more than 30%, delivering significantly better efficiency than Apriel-1.5-15B-Thinker.
  • Scores 69 on Tau2 Bench Telecom and 69 on IFBench, which are key benchmarks for the enterprise domain.
  • At 15B parameters, the model fits on a single GPU, making it highly memory-efficient.
  • Based on community feedback on Apriel-1.5-15B-Thinker, we simplified the chat template by removing redundant tags and introduced four special tokens to the tokenizer (<tool_calls>, </tool_calls>, [BEGIN FINAL RESPONSE], <|end|>) for easier output parsing.

Please see our blog post for more details

⚠️ PLEASE READ THIS BEFORE USING THESE EXPERIMENTAL VERSIONS! ⚠️

An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but my focus has been primarily on quantization and pruning.

The method to produce these experimental versions involves using a custom version of llama-imatrix to generate an imatrix including the mean activations, and a custom version of llama-quantize, which computes a per-tensor weighted mean squared quantization error and a bias/projection term (if the imatrix includes activations), to automatically select the lowest error quantization recipe that achieves a global target bits‑per‑weight (bpw). More details on the implementation and test results here

There are two pull requests (#14891 & #15550) to merge these changes back into the core llama.cpp project. This may or may not ever happen so, until then, the modified versions will be available on GitHub.

For testing and comparison, I use models produced by Bartowski (see credits below) and Unsloth (Daniel and Michael Han do some really interesting stuff!) but when they don't provide versions of the required model, tests and comparisons are against standard quantization obtained by simply running llama-quantize with no further optimizations.

All experimental versions were generated using an appropriate imatrix created from datasets available at eaddario/imatrix-calibration. In llama.cpp, an imatrix is a calibration file derived from running representative text through the model and collecting activation statistics. It is used to weight quantization error so that error in more “important” directions (as estimated from activations) is penalized more heavily.

The process to generate these models is roughly as follows:

  1. Convert the original model's safetensors to GGUF F16*
  2. Estimate the Perplexity score for the F16 model (baseline) using the wikitext-2-raw-v1 dataset, and save the logits
  3. Generate an imatrix from the most appropriate calibration dataset
  4. Quantize the baseline model targeting a bpw average, allocating more bits to tensors estimated to matter more (e.g. llama-quantize --target-bpw 4.5678 --keep-bpw-state --imatrix imatrix.gguf baseline-model-F16.gguf 12)
  5. Quantize the baseline model targeting a bpw average, treating each tensor equally instead of prioritizing some (e.g. llama-quantize --target-bpw 4.5678 --no-importance --keep-bpw-state --imatrix imatrix.gguf baseline-model-F16.gguf 12)
  6. Calculate Perplexity, KL Divergence, ARC (Easy+Challenge), HellaSwag, MMLU, Truthful QA and WinoGrande scores for each quantized model
  7. Keep version with the best 𝜌PPL scores (i.e. highest Cor(ln(PPL(Q)), ln(PPL(base))))
  8. Repeat until all desired quants are created

*BF16 would be preferred, but F16 performs better on Apple's GPUs

Advantages and disadvantages of the global target bits‑per‑weight quantization process

Advantages

  1. Target arbitrary size models

    • When specifying --target-bpw 4.5678 for instance, the algorithm will produce a model (nearly) exactly of that size, which is very useful for maximizing VRAM usage. In a system with 24GB VRAM and a 70B model, standard quants might produce a 16.8GB file (too small, quality left on table) or a 24.1GB file (won't fit). This approach can generate a 23.85GB file to utilize the hardware fully.
  2. Data-driven mixed precision often can improve quality at fixed size

    • Instead of using hardcoded heuristics (e.g. make attn_v Q5_K for a 70B model), that may be sub‑optimal for a given architecture or size, the quantization mix is determined by the actual error sensitivity of the specific model's weights. This, in practice, often yields a better quality/size trade-off, especially in aggressive quantization scenarios (1.5 to 3.5 bpw), or for unusual architectures.

    • Please note: llama.cpp’s heuristics have been tuned across many models and are highly optimized; although the target bpw method produces better quality often (>75% based on tests with 130 models from 11 different families), it can also lose in surprising cases.

  3. Allows better like-for-like comparisons between models and families

    • Standard llama.cpp quantization uses hardcoded rules like: "use Q4_K_M, except bump some tensors up/down, except fall back if incompatible, except keep some tensors unquantized..." and for that reason, two different models quantized with the same Q4_K_M type can end up with very different bpw (e.g. 4.75 and 4.30).

    • All things being equal, the performance of a model is usually proportional to its overall bpw size; models with a higher bpw tend to perform better than lower bpw models. Since model A has simply been given more bits, it will typically perform better (lower perplexity, better eval scores, etc.) even if the underlying quantization method is identical. That makes comparing the performance not a controlled experiment, because the comparison is between models with different effective compression ratios.

    • --target-bpw tries to address that by making the experiment more controlled: each model gets quantized to land on (approximately) the same global byte budget, so that the models' performance differences are more attributable to architecture/training differences, quantization error behaviour at the same compression ratio, optimizer’s allocation decisions, etc.

Disadvantages

  1. Quantization process is significantly slower than standard

    • This approach can take 5x-10x longer as it quantizes a sample of most tensors into 15 different formats, dequantizes them back to floats, computes error diffs, and selects the best size/error option that fits the global bpw budget.

    • However, the --keep-bpw-state option will save the above-mentioned computations to disk so that future quantizations, in the permissible bpw range for the same model, can be generated at normal speed. It also allows to interrupt the computation process and resume it at a later time.

  2. The optimization target is only a proxy for the model's performance quality

    • The process minimizes a per-tensor estimated error computed from sampled rows, not actual perplexity or divergence of output distributions (a future version may address this). Since errors interact nonlinearly across layers, there are no guarantees it will select the best possible quantization recipe subject to the bpw size constraint.

    • Furthermore, the process can operate in two modes: giving priority to important tensors (default) or treating each tensor equally (setting the --no-importance option). To my knowledge, there is no computationally feasible way to determine ahead of time which modality will yield better results, and two runs per model may be needed to obtain the best quality, but the default mode usually wins.

  3. An imatrix with activations data is required for best results

    • Activation data is required to compute the bias factor (i.e. the systematic error projected onto activation directions). If the imatrix file does not contain activation data, the quantization recipe will likely be sub-optimal.

Models

Bits per weight, size, perplexity and KL Divergence scores

Model BPW Size (GB) μPPL 𝜌PPL μKLD Same Top-P
Apriel-1.6-15b-Thinker-F16 16.0006 28.9 10.328827 ±0.082711 100% N/A N/A
Apriel-1.6-15b-Thinker-IQ1_L 1.7500 3.2 56.217779 ±0.451822 70.15% 2.134610 ±0.004486 43.765 ±0.127
Apriel-1.6-15b-Thinker-IQ2_S 2.2500 4.1 17.718732 ±0.143206 85.86% 0.821039 ±0.002826 64.366 ±0.123
Apriel-1.6-15b-Thinker-IQ2_XS 2.1249 3.8 22.329859 ±0.168811 81.92% 1.145454 ±0.003109 59.614 ±0.126
Apriel-1.6-15b-Thinker-IQ2_XXS 2.0000 3.2 29.746083 ±0.227941 77.87% 1.478655 ±0.003549 54.132 ±0.128
Apriel-1.6-15b-Thinker-IQ3_XXS 3.0000 5.4 11.614368 ±0.090793 94.25% 0.316263 ±0.001484 77.862 ±0.107
Apriel-1.6-15b-Thinker-Q2_K 2.5000 4.5 14.935966 ±0.121114 89.08% 0.613853 ±0.002378 69.725 ±0.118
Apriel-1.6-15b-Thinker-Q3_K_L 3.7499 6.8 10.764226 ±0.085149 97.83% 0.120406 ±0.000724 86.182 ±0.089
Apriel-1.6-15b-Thinker-Q3_K_S 3.2499 5.9 11.456100 ±0.093266 96.44% 0.200764 ±0.001105 82.493 ±0.098
Apriel-1.6-15b-Thinker-Q3_K 3.5000 6.3 10.928427 ±0.088056 97.26% 0.150377 ±0.000903 84.977 ±0.092
Apriel-1.6-15b-Thinker-Q4_K_S 4.2498 7.7 10.510888 ±0.084111 98.95% 0.055504 ±0.000500 90.906 ±0.074
Apriel-1.6-15b-Thinker-Q4_K 4.5000 8.1 10.476377 ±0.084179 99.25% 0.038836 ±0.000384 92.373 ±0.068
Apriel-1.6-15b-Thinker-Q4_K_M-bartowski 4.8667 8.8 10.385404 ±0.083338 99.37% 0.032732 ±0.000359 93.084 ±0.065
Apriel-1.6-15b-Thinker-Q4_K_M-bpw 4.8666 8.8 10.447756 ±0.083977 99.45% 0.028314 ±0.000313 93.513 ±0.063
Apriel-1.6-15b-Thinker-Q5_K_S 5.2500 9.5 10.407479 ±0.083697 99.64% 0.018810 ±0.000252 94.688 ±0.058
Apriel-1.6-15b-Thinker-Q5_K 5.5000 9.9 10.411958 ±0.083741 99.74% 0.012866 ±0.000182 95.449 ±0.054
Apriel-1.6-15b-Thinker-Q6_K 6.4998 11.7 10.356608 ±0.083190 99.89% 0.004796 ±0.000159 97.435 ±0.041
Apriel-1.6-15b-Thinker-Q8_0 8.4998 15.3 10.354961 ±0.083173 99.96% 0.000659 ±0.000027 99.072 ±0.025

ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores

Scores generated using llama-perplexity with 750 tasks per test, and a context size of 768 tokens.

For the test data used in the generation of these scores, follow the appropriate links: HellaSwag, ARC, MMLU, Truthful QA and WinoGrande

Model ARC HellaSwag MMLU Truthful QA WinoGrande Avg Score
Apriel-1.6-15b-Thinker-IQ1_L 35.6000 42.1333 26.8000 29.4667 52.9333 37.39
Apriel-1.6-15b-Thinker-IQ2_S 52.8000 60.5333 35.6000 28.8000 56.8000 46.91
Apriel-1.6-15b-Thinker-IQ2_XS 40.9333 54.8000 30.1333 27.6000 58.4000 42.37
Apriel-1.6-15b-Thinker-IQ2_XXS 40.0000 50.5333 30.2667 25.7333 55.7333 40.45
Apriel-1.6-15b-Thinker-IQ3_XXS 58.1333 68.6666 33.7333 28.2667 67.0667 51.17
Apriel-1.6-15b-Thinker-Q2_K 49.8667 59.8666 34.5333 28.2667 62.2667 46.96
Apriel-1.6-15b-Thinker-Q3_K_L 58.5333 71.0666 38.1333 31.8667 65.3333 52.99
Apriel-1.6-15b-Thinker-Q3_K_S 60.6667 69.2000 37.0667 29.0667 66.5333 52.51
Apriel-1.6-15b-Thinker-Q3_K 62.1333 69.7333 37.6000 32.5333 65.2000 53.44
Apriel-1.6-15b-Thinker-Q4_K_S 59.3333 69.8666 37.7333 30.5333 66.9333 52.88
Apriel-1.6-15b-Thinker-Q4_K 58.2667 71.3333 37.2000 30.8000 68.4000 53.20
Apriel-1.6-15b-Thinker-Q4_K_M-bartowski 62.0000 71.3333 37.3333 31.3333 67.4667 53.89
Apriel-1.6-15b-Thinker-Q4_K_M-bpw 58.9333 71.0666 38.2667 30.8000 67.6000 53.33
Apriel-1.6-15b-Thinker-Q5_K_S 61.0667 70.5333 38.0000 30.9333 68.2667 53.76
Apriel-1.6-15b-Thinker-Q5_K 62.5333 70.5333 37.7333 30.6667 66.9333 53.68
Apriel-1.6-15b-Thinker-Q6_K 61.6000 71.6000 37.7333 31.0667 65.2000 53.44
Apriel-1.6-15b-Thinker-Q8_0 61.7333 70.4000 38.0000 30.9333 66.8000 53.57

Tokens per second benchmarks

Scores generated using llama-bench. Standard (llama-quantize with no optimization) Q4_K_M quantization included for comparison.

model size params backend threads test t/s
Apriel-1.6-15b-Thinker-Q4_K_M-bpw 8.17 GiB 14.43 B Metal,BLAS 12 pp512 467.49 ±3.38
Apriel-1.6-15b-Thinker-Q4_K_M-bpw 8.17 GiB 14.43 B Metal,BLAS 12 tg128 38.30 ±1.34
Apriel-1.6-15b-Thinker-Q4_K_M-bpw 8.17 GiB 14.43 B Metal,BLAS 12 pp1024+tg1024 61.10 ±2.58
Apriel-1.6-15b-Thinker-Q4_K_M-bartowski 8.17 GiB 14.43 B Metal,BLAS 12 pp512 464.85 ±14.30
Apriel-1.6-15b-Thinker-Q4_K_M-bartowski 8.17 GiB 14.43 B Metal,BLAS 12 tg128 43.63 ±1.79
Apriel-1.6-15b-Thinker-Q4_K_M-bartowski 8.17 GiB 14.43 B Metal,BLAS 12 pp1024+tg1024 71.94 ±0.78

Metrics used

Perplexity: one of the key metrics used in NLP evaluation. It measures the quality of a language model by evaluating how well it predicts the next token given a particular sequence of words. A PPL of 1 indicates an exact match between predicted and actual, whereas values greater than one indicate a degree of "surprise" the generated token differs from the expected.

Kullback–Leibler (KL) Divergence: a statistical measure of how much a probability distribution differs from another. When quantizing models (or altering the original tensors in any way for that matter), the closest we can preserve the weights' probability distribution to the original model the better, thus the closest to 0 the better.

AI2 Reasoning Challenge (ARC): a benchmark to evaluate the ability of AI models to answer complex science questions that require logical reasoning beyond pattern matching.

HellaSwag: the Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations (bit of a mouthful!) is a benchmark designed to test commonsense natural language inference. It requires the model to predict the most likely ending of a sentence.

MMLU: the Massive Multitask Language Understanding evaluates LLMs’ general knowledge and problem-solving abilities across 57 subjects, including elementary mathematics, US history, computer science, and law.

Truthful QA: evaluates how well LLMs generate truthful responses to questions. It identifies whether AI models can avoid generating false or misleading information, particularly in areas where human knowledge is prone to misconceptions.

Winogrande: based on the Winograd Schema Challenge, is a natural language understanding task requiring models to resolve ambiguities in sentences involving pronoun references.

Credits

LLaMa C++ has a large and vibrant community of contributors (~1,200 last time I checked) that actively maintain and extend its functionality, adding new models and architectures almost as fast as they appear. Considering the breakneck speed at which the AI/ML field is advancing, this alone is a remarkable feat!

While I'm grateful to all contributors, I want to recognise three in particular:

  • Colin Kealty, for the many contributions and for being one of the best sources of high quality quantized models available on Hugging Face
  • Georgi Gerganov for his amazing work with llama.cpp and the ggml/gguf libraries
  • Iwan Kawrakow for being one of the key authors behind the many quantization algorithms and the imatrix functionality.
Downloads last month
942
GGUF
Model size
14B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eaddario/Apriel-1.6-15b-Thinker-GGUF

Quantized
(14)
this model

Dataset used to train eaddario/Apriel-1.6-15b-Thinker-GGUF