markdown · 3128 bytes Raw Blame History

Quantization tradeoffs

dlm export --quant <Q> picks how aggressively the base model gets compressed on the way out. Smaller files run faster; more aggressive quantization costs quality. Here's the cheat sheet.

The quant levels

Quant Bits/weight (avg) Size vs F16 Notes
F16 16 100% No quantization. Baseline for quality comparisons.
Q8_0 8.5 ~55% Near-lossless. Still noticeably smaller.
Q6_K 6.6 ~42% Strong quality, middle-ground size.
Q5_K_M 5.7 ~37% The "willing to spend disk for quality" default.
Q4_K_M 4.8 ~31% The recommended starting point. Great quality/size.

For a 1.5B-parameter base:

  • F16 → ~3.0 GB
  • Q8_0 → ~1.6 GB
  • Q5_K_M → ~1.1 GB
  • Q4_K_M → ~0.95 GB

When to pick which

Q4_K_M (default) : Production recommendation for v1.0. Good quality, fits in a "normal" amount of RAM/VRAM, fast inference. Start here.

Q5_K_M : You have disk to spare and want slightly better generations. The size bump is modest; the quality bump is audible.

Q6_K : Willing to trade another ~10% disk for near-Q8_0 quality. Useful when A/B testing against full-precision behavior.

Q8_0 : Baseline for "is the quant regression real?" investigations. If Q8_0 also regresses, the bug isn't the quant.

F16 : Debugging a quant-caused regression, or running on a platform where the kernels for quantized inference are slower than f16 for some reason (rare on modern CPUs/GPUs).

Imatrix-calibrated quantization

Sprint 11.6 added automatic importance-matrix calibration when your store has enough replay-corpus text. The imatrix tells llama-quantize which weight directions matter most for the model's behavior on YOUR content — so the low-bit quants preserve the directions that matter and compress the rest more aggressively.

$ uv run dlm export tutor.dlm --quant Q4_K_M
# imatrix built from replay/corpus.zst, cached, applied automatically

Empirically, imatrix-calibrated Q4_K_M is close to static Q5_K_M quality at Q4_K_M size. The imatrix is cached per-document so subsequent dlm export runs at the same quant reuse it.

Opt out with --no-imatrix if you want a static quant for a regression comparison.

QLoRA + --merged is a safety gate

$ uv run dlm export tutor.dlm --merged --quant Q4_K_M
export: merge refused: adapter was trained with QLoRA (4-bit base);
        merging into a quantized base is precision-unsafe. Re-run
        with --dequantize to dequantize to fp16 before merge, or drop
        --merged to ship base + adapter separately.

The default (base + adapter separate) is fine for almost every use case — Ollama loads them with FROM + ADAPTER directives and merges at inference time. Use --merged --dequantize only if you need a single-file deployment and accept the bigger artifact.

See also

View source
1 # Quantization tradeoffs
2
3 `dlm export --quant <Q>` picks how aggressively the base model gets
4 compressed on the way out. Smaller files run faster; more aggressive
5 quantization costs quality. Here's the cheat sheet.
6
7 ## The quant levels
8
9 | Quant | Bits/weight (avg) | Size vs F16 | Notes |
10 |---|---|---|---|
11 | `F16` | 16 | 100% | No quantization. Baseline for quality comparisons. |
12 | `Q8_0` | 8.5 | ~55% | Near-lossless. Still noticeably smaller. |
13 | `Q6_K` | 6.6 | ~42% | Strong quality, middle-ground size. |
14 | `Q5_K_M` | 5.7 | ~37% | The "willing to spend disk for quality" default. |
15 | `Q4_K_M` | 4.8 | ~31% | The recommended starting point. Great quality/size. |
16
17 For a 1.5B-parameter base:
18
19 - `F16` → ~3.0 GB
20 - `Q8_0` → ~1.6 GB
21 - `Q5_K_M` → ~1.1 GB
22 - `Q4_K_M` → ~0.95 GB
23
24 ## When to pick which
25
26 **`Q4_K_M` (default)**
27 : Production recommendation for v1.0. Good quality, fits in a
28 "normal" amount of RAM/VRAM, fast inference. Start here.
29
30 **`Q5_K_M`**
31 : You have disk to spare and want slightly better generations. The
32 size bump is modest; the quality bump is audible.
33
34 **`Q6_K`**
35 : Willing to trade another ~10% disk for near-`Q8_0` quality. Useful
36 when A/B testing against full-precision behavior.
37
38 **`Q8_0`**
39 : Baseline for "is the quant regression real?" investigations. If
40 `Q8_0` also regresses, the bug isn't the quant.
41
42 **`F16`**
43 : Debugging a quant-caused regression, or running on a platform where
44 the kernels for quantized inference are slower than f16 for some
45 reason (rare on modern CPUs/GPUs).
46
47 ## Imatrix-calibrated quantization
48
49 Sprint 11.6 added automatic **importance-matrix** calibration when
50 your store has enough replay-corpus text. The imatrix tells
51 `llama-quantize` which weight directions matter most for the model's
52 behavior on YOUR content — so the low-bit quants preserve the
53 directions that matter and compress the rest more aggressively.
54
55 ```sh
56 $ uv run dlm export tutor.dlm --quant Q4_K_M
57 # imatrix built from replay/corpus.zst, cached, applied automatically
58 ```
59
60 Empirically, imatrix-calibrated `Q4_K_M` is close to static `Q5_K_M`
61 quality at `Q4_K_M` size. The imatrix is cached per-document so
62 subsequent `dlm export` runs at the same quant reuse it.
63
64 Opt out with `--no-imatrix` if you want a static quant for a
65 regression comparison.
66
67 ## QLoRA + `--merged` is a safety gate
68
69 ```sh
70 $ uv run dlm export tutor.dlm --merged --quant Q4_K_M
71 export: merge refused: adapter was trained with QLoRA (4-bit base);
72 merging into a quantized base is precision-unsafe. Re-run
73 with --dequantize to dequantize to fp16 before merge, or drop
74 --merged to ship base + adapter separately.
75 ```
76
77 The default (base + adapter separate) is fine for almost every use
78 case — Ollama loads them with `FROM` + `ADAPTER` directives and
79 merges at inference time. Use `--merged --dequantize` only if you
80 need a single-file deployment and accept the bigger artifact.
81
82 ## See also
83
84 - [First export walkthrough](../getting-started/first-export.md) for
85 the full flow
86 - [Determinism](../determinism.md) — the quant tuple participates in
87 the `dlm.lock` reproducibility record