markdown · 3957 bytes Raw Blame History

First export

dlm export converts the base + adapter into GGUF files, writes a Modelfile with an explicit Go text/template (no fuzzy matching), registers the model with ollama create, and runs a smoke prompt.

That is still the default path, but it is no longer the only one. Sprint 41 also adds local runtime targets such as llama-server, vllm, and mlx-serve; see the multi-target export cookbook once you want an OpenAI-compatible local server instead of an Ollama model.

Prerequisites

  • vendor/llama.cpp submodule is built:

    $ scripts/bump-llama-cpp.sh build
    

    This compiles llama-quantize and llama-imatrix under vendor/llama.cpp/build/bin/.

  • Ollama is installed and its daemon is running. dlm doctor reports the minimum version.

Export

$ uv run dlm export tutor.dlm --quant Q4_K_M --name my-tutor
export: preflight ok
export: base.Q4_K_M.gguf (47 MiB)
export: adapter.gguf (3 MiB)
export: Modelfile written; ollama create my-tutor:latest
export: smoke: "Hi!" → "Hello! How can I help?"
manifest: exports[-1] recorded at ~/.dlm/store/01KC…/

Under the hood:

  1. The export preflight (Sprint 11) checks the adapter config matches the base architecture, asserts the tokenizer vocab agrees with the base, validates the chat template, and confirms the adapter wasn't QLoRA-trained (pitfall #3 — QLoRA merge needs --dequantize).
  2. The base model is converted to GGUF and quantized via llama-quantize. The GGUF is cached under ~/.dlm/store/<id>/exports/Q4_K_M/base.Q4_K_M.gguf — subsequent exports at the same quant reuse the file.
  3. The LoRA adapter is converted to adapter.gguf.
  4. An explicit Modelfile is emitted with FROM, ADAPTER, and an explicit TEMPLATE "..." directive (Sprint 12). Ollama will not fuzzy-match the template — the exact Go template for the base's dialect is committed.
  5. ollama create <name>:latest registers the model under the Ollama daemon's control.
  6. A smoke prompt runs; the first line of output is recorded in manifest.exports[-1].smoke_output_first_line.

Quant levels

Quant Size Quality When to use
Q4_K_M ~50% of fp16 Great default General-purpose; recommended starting point.
Q5_K_M ~60% Higher quality Willing to trade more disk for fidelity.
Q8_0 ~100% of int8 Near-lossless Baseline for quality comparisons.
F16 100% No quantization Debugging a quant-caused regression.

See Quantization tradeoffs for a deeper dive.

imatrix-calibrated quantization

If your store has a replay corpus with enough signal (Sprint 11.6), the export runner automatically builds an imatrix from it and passes --imatrix to llama-quantize. This gives noticeable quality improvements on Q4_K_M and below without changing the API.

Opt out with --no-imatrix if you'd rather have a static quant for comparison.

Just produce GGUFs, skip Ollama

$ uv run dlm export tutor.dlm --quant Q4_K_M --skip-ollama

Useful on CI runners without the Ollama daemon installed. The GGUFs land in exports/Q4_K_M/; wire them into your own runtime.

Other runtime targets

Once the basic GGUF/Ollama flow is familiar, the same store can export to:

  • --target llama-server for a vendored llama.cpp HTTP server
  • --target vllm for HF-snapshot + LoRA-module serving
  • --target mlx-serve for Apple Silicon text serving through mlx_lm.server

Those targets have different prerequisites and artifact layouts, so they live in the multi-target export cookbook instead of this first-run page.

Next

Want to send the whole training history to a friend? The Sharing with pack cookbook shows the dlm pack / dlm unpack round trip.

View source
1 # First export
2
3 `dlm export` converts the base + adapter into GGUF files, writes a
4 Modelfile with an explicit Go `text/template` (no fuzzy matching),
5 registers the model with `ollama create`, and runs a smoke prompt.
6
7 That is still the default path, but it is no longer the only one. Sprint 41
8 also adds local runtime targets such as `llama-server`, `vllm`, and
9 `mlx-serve`; see the [multi-target export cookbook](../cookbook/multi-target-export.md)
10 once you want an OpenAI-compatible local server instead of an Ollama model.
11
12 ## Prerequisites
13
14 - `vendor/llama.cpp` submodule is built:
15 ```sh
16 $ scripts/bump-llama-cpp.sh build
17 ```
18 This compiles `llama-quantize` and `llama-imatrix` under
19 `vendor/llama.cpp/build/bin/`.
20
21 - [Ollama](https://ollama.com/) is installed and its daemon is running.
22 `dlm doctor` reports the minimum version.
23
24 ## Export
25
26 ```sh
27 $ uv run dlm export tutor.dlm --quant Q4_K_M --name my-tutor
28 export: preflight ok
29 export: base.Q4_K_M.gguf (47 MiB)
30 export: adapter.gguf (3 MiB)
31 export: Modelfile written; ollama create my-tutor:latest
32 export: smoke: "Hi!" → "Hello! How can I help?"
33 manifest: exports[-1] recorded at ~/.dlm/store/01KC…/
34 ```
35
36 Under the hood:
37
38 1. The export **preflight** (Sprint 11) checks the adapter config
39 matches the base architecture, asserts the tokenizer vocab agrees
40 with the base, validates the chat template, and confirms the
41 adapter wasn't QLoRA-trained (pitfall #3 — QLoRA merge needs
42 `--dequantize`).
43 2. The base model is converted to GGUF and quantized via
44 `llama-quantize`. The GGUF is cached under
45 `~/.dlm/store/<id>/exports/Q4_K_M/base.Q4_K_M.gguf` — subsequent
46 exports at the same quant reuse the file.
47 3. The LoRA adapter is converted to `adapter.gguf`.
48 4. An explicit `Modelfile` is emitted with `FROM`, `ADAPTER`, and an
49 explicit `TEMPLATE "..."` directive (Sprint 12). Ollama will **not**
50 fuzzy-match the template — the exact Go template for the base's
51 dialect is committed.
52 5. `ollama create <name>:latest` registers the model under the Ollama
53 daemon's control.
54 6. A smoke prompt runs; the first line of output is recorded in
55 `manifest.exports[-1].smoke_output_first_line`.
56
57 ## Quant levels
58
59 | Quant | Size | Quality | When to use |
60 |---|---|---|---|
61 | `Q4_K_M` | ~50% of fp16 | Great default | General-purpose; recommended starting point. |
62 | `Q5_K_M` | ~60% | Higher quality | Willing to trade more disk for fidelity. |
63 | `Q8_0` | ~100% of int8 | Near-lossless | Baseline for quality comparisons. |
64 | `F16` | 100% | No quantization | Debugging a quant-caused regression. |
65
66 See [Quantization tradeoffs](../cookbook/quantization-tradeoffs.md) for
67 a deeper dive.
68
69 ## imatrix-calibrated quantization
70
71 If your store has a replay corpus with enough signal (Sprint 11.6),
72 the export runner automatically builds an imatrix from it and passes
73 `--imatrix` to `llama-quantize`. This gives noticeable quality
74 improvements on `Q4_K_M` and below without changing the API.
75
76 Opt out with `--no-imatrix` if you'd rather have a static quant for
77 comparison.
78
79 ## Just produce GGUFs, skip Ollama
80
81 ```sh
82 $ uv run dlm export tutor.dlm --quant Q4_K_M --skip-ollama
83 ```
84
85 Useful on CI runners without the Ollama daemon installed. The GGUFs
86 land in `exports/Q4_K_M/`; wire them into your own runtime.
87
88 ## Other runtime targets
89
90 Once the basic GGUF/Ollama flow is familiar, the same store can export to:
91
92 - `--target llama-server` for a vendored `llama.cpp` HTTP server
93 - `--target vllm` for HF-snapshot + LoRA-module serving
94 - `--target mlx-serve` for Apple Silicon text serving through `mlx_lm.server`
95
96 Those targets have different prerequisites and artifact layouts, so they live
97 in the [multi-target export cookbook](../cookbook/multi-target-export.md)
98 instead of this first-run page.
99
100 ## Next
101
102 Want to send the whole training history to a friend? The
103 [Sharing with pack](../cookbook/sharing-with-pack.md) cookbook shows
104 the `dlm pack` / `dlm unpack` round trip.