documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 3128 bytes Raw Blame History

Quantization tradeoffs

dlm export --quant <Q> picks how aggressively the base model gets compressed on the way out. Smaller files run faster; more aggressive quantization costs quality. Here's the cheat sheet.

The quant levels

Quant	Bits/weight (avg)	Size vs F16	Notes
`F16`	16	100%	No quantization. Baseline for quality comparisons.
`Q8_0`	8.5	~55%	Near-lossless. Still noticeably smaller.
`Q6_K`	6.6	~42%	Strong quality, middle-ground size.
`Q5_K_M`	5.7	~37%	The "willing to spend disk for quality" default.
`Q4_K_M`	4.8	~31%	The recommended starting point. Great quality/size.

For a 1.5B-parameter base:

F16 → ~3.0 GB
Q8_0 → ~1.6 GB
Q5_K_M → ~1.1 GB
Q4_K_M → ~0.95 GB

When to pick which

Q4_K_M (default) : Production recommendation for v1.0. Good quality, fits in a "normal" amount of RAM/VRAM, fast inference. Start here.

Q5_K_M : You have disk to spare and want slightly better generations. The size bump is modest; the quality bump is audible.

Q6_K : Willing to trade another ~10% disk for near-Q8_0 quality. Useful when A/B testing against full-precision behavior.

Q8_0 : Baseline for "is the quant regression real?" investigations. If Q8_0 also regresses, the bug isn't the quant.

F16 : Debugging a quant-caused regression, or running on a platform where the kernels for quantized inference are slower than f16 for some reason (rare on modern CPUs/GPUs).

Imatrix-calibrated quantization

Sprint 11.6 added automatic importance-matrix calibration when your store has enough replay-corpus text. The imatrix tells llama-quantize which weight directions matter most for the model's behavior on YOUR content — so the low-bit quants preserve the directions that matter and compress the rest more aggressively.

$ uv run dlm export tutor.dlm --quant Q4_K_M
# imatrix built from replay/corpus.zst, cached, applied automatically

Empirically, imatrix-calibrated Q4_K_M is close to static Q5_K_M quality at Q4_K_M size. The imatrix is cached per-document so subsequent dlm export runs at the same quant reuse it.

Opt out with --no-imatrix if you want a static quant for a regression comparison.

QLoRA + `--merged` is a safety gate

$ uv run dlm export tutor.dlm --merged --quant Q4_K_M
export: merge refused: adapter was trained with QLoRA (4-bit base);
        merging into a quantized base is precision-unsafe. Re-run
        with --dequantize to dequantize to fp16 before merge, or drop
        --merged to ship base + adapter separately.

The default (base + adapter separate) is fine for almost every use case — Ollama loads them with FROM + ADAPTER directives and merges at inference time. Use --merged --dequantize only if you need a single-file deployment and accept the bigger artifact.

  
        1
        # Quantization tradeoffs
      
        2
        
        3
        `dlm export --quant <Q>` picks how aggressively the base model gets
      
        4
        compressed on the way out. Smaller files run faster; more aggressive
      
        5
        quantization costs quality. Here's the cheat sheet.
      
        6
        
        7
        ## The quant levels
      
        8
        
        9
        | Quant | Bits/weight (avg) | Size vs F16 | Notes |
      
        10
        |---|---|---|---|
      
        11
        | `F16` | 16 | 100% | No quantization. Baseline for quality comparisons. |
      
        12
        | `Q8_0` | 8.5 | ~55% | Near-lossless. Still noticeably smaller. |
      
        13
        | `Q6_K` | 6.6 | ~42% | Strong quality, middle-ground size. |
      
        14
        | `Q5_K_M` | 5.7 | ~37% | The "willing to spend disk for quality" default. |
      
        15
        | `Q4_K_M` | 4.8 | ~31% | The recommended starting point. Great quality/size. |
      
        16
        
        17
        For a 1.5B-parameter base:
      
        18
        
        19
        - `F16` → ~3.0 GB
      
        20
        - `Q8_0` → ~1.6 GB
      
        21
        - `Q5_K_M` → ~1.1 GB
      
        22
        - `Q4_K_M` → ~0.95 GB
      
        23
        
        24
        ## When to pick which
      
        25
        
        26
        **`Q4_K_M` (default)**
      
        27
        : Production recommendation for v1.0. Good quality, fits in a
      
        28
          "normal" amount of RAM/VRAM, fast inference. Start here.
      
        29
        
        30
        **`Q5_K_M`**
      
        31
        : You have disk to spare and want slightly better generations. The
      
        32
          size bump is modest; the quality bump is audible.
      
        33
        
        34
        **`Q6_K`**
      
        35
        : Willing to trade another ~10% disk for near-`Q8_0` quality. Useful
      
        36
          when A/B testing against full-precision behavior.
      
        37
        
        38
        **`Q8_0`**
      
        39
        : Baseline for "is the quant regression real?" investigations. If
      
        40
          `Q8_0` also regresses, the bug isn't the quant.
      
        41
        
        42
        **`F16`**
      
        43
        : Debugging a quant-caused regression, or running on a platform where
      
        44
          the kernels for quantized inference are slower than f16 for some
      
        45
          reason (rare on modern CPUs/GPUs).
      
        46
        
        47
        ## Imatrix-calibrated quantization
      
        48
        
        49
        Sprint 11.6 added automatic **importance-matrix** calibration when
      
        50
        your store has enough replay-corpus text. The imatrix tells
      
        51
        `llama-quantize` which weight directions matter most for the model's
      
        52
        behavior on YOUR content — so the low-bit quants preserve the
      
        53
        directions that matter and compress the rest more aggressively.
      
        54
        
        55
        ```sh
      
        56
        $ uv run dlm export tutor.dlm --quant Q4_K_M
      
        57
        # imatrix built from replay/corpus.zst, cached, applied automatically
      
        58
        ```
      
        59
        
        60
        Empirically, imatrix-calibrated `Q4_K_M` is close to static `Q5_K_M`
      
        61
        quality at `Q4_K_M` size. The imatrix is cached per-document so
      
        62
        subsequent `dlm export` runs at the same quant reuse it.
      
        63
        
        64
        Opt out with `--no-imatrix` if you want a static quant for a
      
        65
        regression comparison.
      
        66
        
        67
        ## QLoRA + `--merged` is a safety gate
      
        68
        
        69
        ```sh
      
        70
        $ uv run dlm export tutor.dlm --merged --quant Q4_K_M
      
        71
        export: merge refused: adapter was trained with QLoRA (4-bit base);
      
        72
                merging into a quantized base is precision-unsafe. Re-run
      
        73
                with --dequantize to dequantize to fp16 before merge, or drop
      
        74
                --merged to ship base + adapter separately.
      
        75
        ```
      
        76
        
        77
        The default (base + adapter separate) is fine for almost every use
      
        78
        case — Ollama loads them with `FROM` + `ADAPTER` directives and
      
        79
        merges at inference time. Use `--merged --dequantize` only if you
      
        80
        need a single-file deployment and accept the bigger artifact.
      
        81
        
        82
        ## See also
      
        83
        
        84
        - [First export walkthrough](../getting-started/first-export.md) for
      
        85
          the full flow
      
        86
        - [Determinism](../determinism.md) — the quant tuple participates in
      
        87
          the `dlm.lock` reproducibility record

1	# Quantization tradeoffs
2
3	`dlm export --quant <Q>` picks how aggressively the base model gets
4	compressed on the way out. Smaller files run faster; more aggressive
5	quantization costs quality. Here's the cheat sheet.
6
7	## The quant levels
8
9	\| Quant \| Bits/weight (avg) \| Size vs F16 \| Notes \|
10	\|---\|---\|---\|---\|
11	\| `F16` \| 16 \| 100% \| No quantization. Baseline for quality comparisons. \|
12	\| `Q8_0` \| 8.5 \| ~55% \| Near-lossless. Still noticeably smaller. \|
13	\| `Q6_K` \| 6.6 \| ~42% \| Strong quality, middle-ground size. \|
14	\| `Q5_K_M` \| 5.7 \| ~37% \| The "willing to spend disk for quality" default. \|
15	\| `Q4_K_M` \| 4.8 \| ~31% \| The recommended starting point. Great quality/size. \|
16
17	For a 1.5B-parameter base:
18
19	- `F16` → ~3.0 GB
20	- `Q8_0` → ~1.6 GB
21	- `Q5_K_M` → ~1.1 GB
22	- `Q4_K_M` → ~0.95 GB
23
24	## When to pick which
25
26	`Q4_K_M` (default)
27	: Production recommendation for v1.0. Good quality, fits in a
28	"normal" amount of RAM/VRAM, fast inference. Start here.
29
30	`Q5_K_M`
31	: You have disk to spare and want slightly better generations. The
32	size bump is modest; the quality bump is audible.
33
34	`Q6_K`
35	: Willing to trade another ~10% disk for near-`Q8_0` quality. Useful
36	when A/B testing against full-precision behavior.
37
38	`Q8_0`
39	: Baseline for "is the quant regression real?" investigations. If
40	`Q8_0` also regresses, the bug isn't the quant.
41
42	`F16`
43	: Debugging a quant-caused regression, or running on a platform where
44	the kernels for quantized inference are slower than f16 for some
45	reason (rare on modern CPUs/GPUs).
46
47	## Imatrix-calibrated quantization
48
49	Sprint 11.6 added automatic importance-matrix calibration when
50	your store has enough replay-corpus text. The imatrix tells
51	`llama-quantize` which weight directions matter most for the model's
52	behavior on YOUR content — so the low-bit quants preserve the
53	directions that matter and compress the rest more aggressively.
54
55	```sh
56	$ uv run dlm export tutor.dlm --quant Q4_K_M
57	# imatrix built from replay/corpus.zst, cached, applied automatically
58	```
59
60	Empirically, imatrix-calibrated `Q4_K_M` is close to static `Q5_K_M`
61	quality at `Q4_K_M` size. The imatrix is cached per-document so
62	subsequent `dlm export` runs at the same quant reuse it.
63
64	Opt out with `--no-imatrix` if you want a static quant for a
65	regression comparison.
66
67	## QLoRA + `--merged` is a safety gate
68
69	```sh
70	$ uv run dlm export tutor.dlm --merged --quant Q4_K_M
71	export: merge refused: adapter was trained with QLoRA (4-bit base);
72	merging into a quantized base is precision-unsafe. Re-run
73	with --dequantize to dequantize to fp16 before merge, or drop
74	--merged to ship base + adapter separately.
75	```
76
77	The default (base + adapter separate) is fine for almost every use
78	case — Ollama loads them with `FROM` + `ADAPTER` directives and
79	merges at inference time. Use `--merged --dequantize` only if you
80	need a single-file deployment and accept the bigger artifact.
81
82	## See also
83
84	- [First export walkthrough](../getting-started/first-export.md) for
85	the full flow
86	- [Determinism](../determinism.md) — the quant tuple participates in
87	the `dlm.lock` reproducibility record

Quantization tradeoffs

The quant levels

When to pick which

Imatrix-calibrated quantization

QLoRA + --merged is a safety gate

See also

QLoRA + `--merged` is a safety gate