documentlanguagemodel Public
First export
dlm export converts the base + adapter into GGUF files, writes a
Modelfile with an explicit Go text/template (no fuzzy matching),
registers the model with ollama create, and runs a smoke prompt.
That is still the default path, but it is no longer the only one. Sprint 41
also adds local runtime targets such as llama-server, vllm, and
mlx-serve; see the multi-target export cookbook
once you want an OpenAI-compatible local server instead of an Ollama model.
Prerequisites
-
vendor/llama.cppsubmodule is built:$ scripts/bump-llama-cpp.sh buildThis compiles
llama-quantizeandllama-imatrixundervendor/llama.cpp/build/bin/. -
Ollama is installed and its daemon is running.
dlm doctorreports the minimum version.
Export
$ uv run dlm export tutor.dlm --quant Q4_K_M --name my-tutor
export: preflight ok
export: base.Q4_K_M.gguf (47 MiB)
export: adapter.gguf (3 MiB)
export: Modelfile written; ollama create my-tutor:latest
export: smoke: "Hi!" → "Hello! How can I help?"
manifest: exports[-1] recorded at ~/.dlm/store/01KC…/
Under the hood:
- The export preflight (Sprint 11) checks the adapter config
matches the base architecture, asserts the tokenizer vocab agrees
with the base, validates the chat template, and confirms the
adapter wasn't QLoRA-trained (pitfall #3 — QLoRA merge needs
--dequantize). - The base model is converted to GGUF and quantized via
llama-quantize. The GGUF is cached under~/.dlm/store/<id>/exports/Q4_K_M/base.Q4_K_M.gguf— subsequent exports at the same quant reuse the file. - The LoRA adapter is converted to
adapter.gguf. - An explicit
Modelfileis emitted withFROM,ADAPTER, and an explicitTEMPLATE "..."directive (Sprint 12). Ollama will not fuzzy-match the template — the exact Go template for the base's dialect is committed. ollama create <name>:latestregisters the model under the Ollama daemon's control.- A smoke prompt runs; the first line of output is recorded in
manifest.exports[-1].smoke_output_first_line.
Quant levels
| Quant | Size | Quality | When to use |
|---|---|---|---|
Q4_K_M |
~50% of fp16 | Great default | General-purpose; recommended starting point. |
Q5_K_M |
~60% | Higher quality | Willing to trade more disk for fidelity. |
Q8_0 |
~100% of int8 | Near-lossless | Baseline for quality comparisons. |
F16 |
100% | No quantization | Debugging a quant-caused regression. |
See Quantization tradeoffs for a deeper dive.
imatrix-calibrated quantization
If your store has a replay corpus with enough signal (Sprint 11.6),
the export runner automatically builds an imatrix from it and passes
--imatrix to llama-quantize. This gives noticeable quality
improvements on Q4_K_M and below without changing the API.
Opt out with --no-imatrix if you'd rather have a static quant for
comparison.
Just produce GGUFs, skip Ollama
$ uv run dlm export tutor.dlm --quant Q4_K_M --skip-ollama
Useful on CI runners without the Ollama daemon installed. The GGUFs
land in exports/Q4_K_M/; wire them into your own runtime.
Other runtime targets
Once the basic GGUF/Ollama flow is familiar, the same store can export to:
--target llama-serverfor a vendoredllama.cppHTTP server--target vllmfor HF-snapshot + LoRA-module serving--target mlx-servefor Apple Silicon text serving throughmlx_lm.server
Those targets have different prerequisites and artifact layouts, so they live in the multi-target export cookbook instead of this first-run page.
Next
Want to send the whole training history to a friend? The
Sharing with pack cookbook shows
the dlm pack / dlm unpack round trip.
View source
| 1 | # First export |
| 2 | |
| 3 | `dlm export` converts the base + adapter into GGUF files, writes a |
| 4 | Modelfile with an explicit Go `text/template` (no fuzzy matching), |
| 5 | registers the model with `ollama create`, and runs a smoke prompt. |
| 6 | |
| 7 | That is still the default path, but it is no longer the only one. Sprint 41 |
| 8 | also adds local runtime targets such as `llama-server`, `vllm`, and |
| 9 | `mlx-serve`; see the [multi-target export cookbook](../cookbook/multi-target-export.md) |
| 10 | once you want an OpenAI-compatible local server instead of an Ollama model. |
| 11 | |
| 12 | ## Prerequisites |
| 13 | |
| 14 | - `vendor/llama.cpp` submodule is built: |
| 15 | ```sh |
| 16 | $ scripts/bump-llama-cpp.sh build |
| 17 | ``` |
| 18 | This compiles `llama-quantize` and `llama-imatrix` under |
| 19 | `vendor/llama.cpp/build/bin/`. |
| 20 | |
| 21 | - [Ollama](https://ollama.com/) is installed and its daemon is running. |
| 22 | `dlm doctor` reports the minimum version. |
| 23 | |
| 24 | ## Export |
| 25 | |
| 26 | ```sh |
| 27 | $ uv run dlm export tutor.dlm --quant Q4_K_M --name my-tutor |
| 28 | export: preflight ok |
| 29 | export: base.Q4_K_M.gguf (47 MiB) |
| 30 | export: adapter.gguf (3 MiB) |
| 31 | export: Modelfile written; ollama create my-tutor:latest |
| 32 | export: smoke: "Hi!" → "Hello! How can I help?" |
| 33 | manifest: exports[-1] recorded at ~/.dlm/store/01KC…/ |
| 34 | ``` |
| 35 | |
| 36 | Under the hood: |
| 37 | |
| 38 | 1. The export **preflight** (Sprint 11) checks the adapter config |
| 39 | matches the base architecture, asserts the tokenizer vocab agrees |
| 40 | with the base, validates the chat template, and confirms the |
| 41 | adapter wasn't QLoRA-trained (pitfall #3 — QLoRA merge needs |
| 42 | `--dequantize`). |
| 43 | 2. The base model is converted to GGUF and quantized via |
| 44 | `llama-quantize`. The GGUF is cached under |
| 45 | `~/.dlm/store/<id>/exports/Q4_K_M/base.Q4_K_M.gguf` — subsequent |
| 46 | exports at the same quant reuse the file. |
| 47 | 3. The LoRA adapter is converted to `adapter.gguf`. |
| 48 | 4. An explicit `Modelfile` is emitted with `FROM`, `ADAPTER`, and an |
| 49 | explicit `TEMPLATE "..."` directive (Sprint 12). Ollama will **not** |
| 50 | fuzzy-match the template — the exact Go template for the base's |
| 51 | dialect is committed. |
| 52 | 5. `ollama create <name>:latest` registers the model under the Ollama |
| 53 | daemon's control. |
| 54 | 6. A smoke prompt runs; the first line of output is recorded in |
| 55 | `manifest.exports[-1].smoke_output_first_line`. |
| 56 | |
| 57 | ## Quant levels |
| 58 | |
| 59 | | Quant | Size | Quality | When to use | |
| 60 | |---|---|---|---| |
| 61 | | `Q4_K_M` | ~50% of fp16 | Great default | General-purpose; recommended starting point. | |
| 62 | | `Q5_K_M` | ~60% | Higher quality | Willing to trade more disk for fidelity. | |
| 63 | | `Q8_0` | ~100% of int8 | Near-lossless | Baseline for quality comparisons. | |
| 64 | | `F16` | 100% | No quantization | Debugging a quant-caused regression. | |
| 65 | |
| 66 | See [Quantization tradeoffs](../cookbook/quantization-tradeoffs.md) for |
| 67 | a deeper dive. |
| 68 | |
| 69 | ## imatrix-calibrated quantization |
| 70 | |
| 71 | If your store has a replay corpus with enough signal (Sprint 11.6), |
| 72 | the export runner automatically builds an imatrix from it and passes |
| 73 | `--imatrix` to `llama-quantize`. This gives noticeable quality |
| 74 | improvements on `Q4_K_M` and below without changing the API. |
| 75 | |
| 76 | Opt out with `--no-imatrix` if you'd rather have a static quant for |
| 77 | comparison. |
| 78 | |
| 79 | ## Just produce GGUFs, skip Ollama |
| 80 | |
| 81 | ```sh |
| 82 | $ uv run dlm export tutor.dlm --quant Q4_K_M --skip-ollama |
| 83 | ``` |
| 84 | |
| 85 | Useful on CI runners without the Ollama daemon installed. The GGUFs |
| 86 | land in `exports/Q4_K_M/`; wire them into your own runtime. |
| 87 | |
| 88 | ## Other runtime targets |
| 89 | |
| 90 | Once the basic GGUF/Ollama flow is familiar, the same store can export to: |
| 91 | |
| 92 | - `--target llama-server` for a vendored `llama.cpp` HTTP server |
| 93 | - `--target vllm` for HF-snapshot + LoRA-module serving |
| 94 | - `--target mlx-serve` for Apple Silicon text serving through `mlx_lm.server` |
| 95 | |
| 96 | Those targets have different prerequisites and artifact layouts, so they live |
| 97 | in the [multi-target export cookbook](../cookbook/multi-target-export.md) |
| 98 | instead of this first-run page. |
| 99 | |
| 100 | ## Next |
| 101 | |
| 102 | Want to send the whole training history to a friend? The |
| 103 | [Sharing with pack](../cookbook/sharing-with-pack.md) cookbook shows |
| 104 | the `dlm pack` / `dlm unpack` round trip. |