documentlanguagemodel Public
Quantization tradeoffs
dlm export --quant <Q> picks how aggressively the base model gets
compressed on the way out. Smaller files run faster; more aggressive
quantization costs quality. Here's the cheat sheet.
The quant levels
| Quant | Bits/weight (avg) | Size vs F16 | Notes |
|---|---|---|---|
F16 |
16 | 100% | No quantization. Baseline for quality comparisons. |
Q8_0 |
8.5 | ~55% | Near-lossless. Still noticeably smaller. |
Q6_K |
6.6 | ~42% | Strong quality, middle-ground size. |
Q5_K_M |
5.7 | ~37% | The "willing to spend disk for quality" default. |
Q4_K_M |
4.8 | ~31% | The recommended starting point. Great quality/size. |
For a 1.5B-parameter base:
F16→ ~3.0 GBQ8_0→ ~1.6 GBQ5_K_M→ ~1.1 GBQ4_K_M→ ~0.95 GB
When to pick which
Q4_K_M (default)
: Production recommendation for v1.0. Good quality, fits in a
"normal" amount of RAM/VRAM, fast inference. Start here.
Q5_K_M
: You have disk to spare and want slightly better generations. The
size bump is modest; the quality bump is audible.
Q6_K
: Willing to trade another ~10% disk for near-Q8_0 quality. Useful
when A/B testing against full-precision behavior.
Q8_0
: Baseline for "is the quant regression real?" investigations. If
Q8_0 also regresses, the bug isn't the quant.
F16
: Debugging a quant-caused regression, or running on a platform where
the kernels for quantized inference are slower than f16 for some
reason (rare on modern CPUs/GPUs).
Imatrix-calibrated quantization
Sprint 11.6 added automatic importance-matrix calibration when
your store has enough replay-corpus text. The imatrix tells
llama-quantize which weight directions matter most for the model's
behavior on YOUR content — so the low-bit quants preserve the
directions that matter and compress the rest more aggressively.
$ uv run dlm export tutor.dlm --quant Q4_K_M
# imatrix built from replay/corpus.zst, cached, applied automatically
Empirically, imatrix-calibrated Q4_K_M is close to static Q5_K_M
quality at Q4_K_M size. The imatrix is cached per-document so
subsequent dlm export runs at the same quant reuse it.
Opt out with --no-imatrix if you want a static quant for a
regression comparison.
QLoRA + --merged is a safety gate
$ uv run dlm export tutor.dlm --merged --quant Q4_K_M
export: merge refused: adapter was trained with QLoRA (4-bit base);
merging into a quantized base is precision-unsafe. Re-run
with --dequantize to dequantize to fp16 before merge, or drop
--merged to ship base + adapter separately.
The default (base + adapter separate) is fine for almost every use
case — Ollama loads them with FROM + ADAPTER directives and
merges at inference time. Use --merged --dequantize only if you
need a single-file deployment and accept the bigger artifact.
See also
- First export walkthrough for the full flow
- Determinism — the quant tuple participates in
the
dlm.lockreproducibility record
View source
| 1 | # Quantization tradeoffs |
| 2 | |
| 3 | `dlm export --quant <Q>` picks how aggressively the base model gets |
| 4 | compressed on the way out. Smaller files run faster; more aggressive |
| 5 | quantization costs quality. Here's the cheat sheet. |
| 6 | |
| 7 | ## The quant levels |
| 8 | |
| 9 | | Quant | Bits/weight (avg) | Size vs F16 | Notes | |
| 10 | |---|---|---|---| |
| 11 | | `F16` | 16 | 100% | No quantization. Baseline for quality comparisons. | |
| 12 | | `Q8_0` | 8.5 | ~55% | Near-lossless. Still noticeably smaller. | |
| 13 | | `Q6_K` | 6.6 | ~42% | Strong quality, middle-ground size. | |
| 14 | | `Q5_K_M` | 5.7 | ~37% | The "willing to spend disk for quality" default. | |
| 15 | | `Q4_K_M` | 4.8 | ~31% | The recommended starting point. Great quality/size. | |
| 16 | |
| 17 | For a 1.5B-parameter base: |
| 18 | |
| 19 | - `F16` → ~3.0 GB |
| 20 | - `Q8_0` → ~1.6 GB |
| 21 | - `Q5_K_M` → ~1.1 GB |
| 22 | - `Q4_K_M` → ~0.95 GB |
| 23 | |
| 24 | ## When to pick which |
| 25 | |
| 26 | **`Q4_K_M` (default)** |
| 27 | : Production recommendation for v1.0. Good quality, fits in a |
| 28 | "normal" amount of RAM/VRAM, fast inference. Start here. |
| 29 | |
| 30 | **`Q5_K_M`** |
| 31 | : You have disk to spare and want slightly better generations. The |
| 32 | size bump is modest; the quality bump is audible. |
| 33 | |
| 34 | **`Q6_K`** |
| 35 | : Willing to trade another ~10% disk for near-`Q8_0` quality. Useful |
| 36 | when A/B testing against full-precision behavior. |
| 37 | |
| 38 | **`Q8_0`** |
| 39 | : Baseline for "is the quant regression real?" investigations. If |
| 40 | `Q8_0` also regresses, the bug isn't the quant. |
| 41 | |
| 42 | **`F16`** |
| 43 | : Debugging a quant-caused regression, or running on a platform where |
| 44 | the kernels for quantized inference are slower than f16 for some |
| 45 | reason (rare on modern CPUs/GPUs). |
| 46 | |
| 47 | ## Imatrix-calibrated quantization |
| 48 | |
| 49 | Sprint 11.6 added automatic **importance-matrix** calibration when |
| 50 | your store has enough replay-corpus text. The imatrix tells |
| 51 | `llama-quantize` which weight directions matter most for the model's |
| 52 | behavior on YOUR content — so the low-bit quants preserve the |
| 53 | directions that matter and compress the rest more aggressively. |
| 54 | |
| 55 | ```sh |
| 56 | $ uv run dlm export tutor.dlm --quant Q4_K_M |
| 57 | # imatrix built from replay/corpus.zst, cached, applied automatically |
| 58 | ``` |
| 59 | |
| 60 | Empirically, imatrix-calibrated `Q4_K_M` is close to static `Q5_K_M` |
| 61 | quality at `Q4_K_M` size. The imatrix is cached per-document so |
| 62 | subsequent `dlm export` runs at the same quant reuse it. |
| 63 | |
| 64 | Opt out with `--no-imatrix` if you want a static quant for a |
| 65 | regression comparison. |
| 66 | |
| 67 | ## QLoRA + `--merged` is a safety gate |
| 68 | |
| 69 | ```sh |
| 70 | $ uv run dlm export tutor.dlm --merged --quant Q4_K_M |
| 71 | export: merge refused: adapter was trained with QLoRA (4-bit base); |
| 72 | merging into a quantized base is precision-unsafe. Re-run |
| 73 | with --dequantize to dequantize to fp16 before merge, or drop |
| 74 | --merged to ship base + adapter separately. |
| 75 | ``` |
| 76 | |
| 77 | The default (base + adapter separate) is fine for almost every use |
| 78 | case — Ollama loads them with `FROM` + `ADAPTER` directives and |
| 79 | merges at inference time. Use `--merged --dequantize` only if you |
| 80 | need a single-file deployment and accept the bigger artifact. |
| 81 | |
| 82 | ## See also |
| 83 | |
| 84 | - [First export walkthrough](../getting-started/first-export.md) for |
| 85 | the full flow |
| 86 | - [Determinism](../determinism.md) — the quant tuple participates in |
| 87 | the `dlm.lock` reproducibility record |