documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 3957 bytes Raw Blame History

First export

dlm export converts the base + adapter into GGUF files, writes a Modelfile with an explicit Go text/template (no fuzzy matching), registers the model with ollama create, and runs a smoke prompt.

That is still the default path, but it is no longer the only one. Sprint 41 also adds local runtime targets such as llama-server, vllm, and mlx-serve; see the multi-target export cookbook once you want an OpenAI-compatible local server instead of an Ollama model.

Prerequisites

vendor/llama.cpp submodule is built:
```
$ scripts/bump-llama-cpp.sh build
```
This compiles llama-quantize and llama-imatrix under vendor/llama.cpp/build/bin/.
Ollama is installed and its daemon is running. dlm doctor reports the minimum version.

Export

$ uv run dlm export tutor.dlm --quant Q4_K_M --name my-tutor
export: preflight ok
export: base.Q4_K_M.gguf (47 MiB)
export: adapter.gguf (3 MiB)
export: Modelfile written; ollama create my-tutor:latest
export: smoke: "Hi!" → "Hello! How can I help?"
manifest: exports[-1] recorded at ~/.dlm/store/01KC…/

Under the hood:

The export preflight (Sprint 11) checks the adapter config matches the base architecture, asserts the tokenizer vocab agrees with the base, validates the chat template, and confirms the adapter wasn't QLoRA-trained (pitfall #3 — QLoRA merge needs --dequantize).
The base model is converted to GGUF and quantized via llama-quantize. The GGUF is cached under ~/.dlm/store/<id>/exports/Q4_K_M/base.Q4_K_M.gguf — subsequent exports at the same quant reuse the file.
The LoRA adapter is converted to adapter.gguf.
An explicit Modelfile is emitted with FROM, ADAPTER, and an explicit TEMPLATE "..." directive (Sprint 12). Ollama will not fuzzy-match the template — the exact Go template for the base's dialect is committed.
ollama create <name>:latest registers the model under the Ollama daemon's control.
A smoke prompt runs; the first line of output is recorded in manifest.exports[-1].smoke_output_first_line.

Quant levels

Quant	Size	Quality	When to use
`Q4_K_M`	~50% of fp16	Great default	General-purpose; recommended starting point.
`Q5_K_M`	~60%	Higher quality	Willing to trade more disk for fidelity.
`Q8_0`	~100% of int8	Near-lossless	Baseline for quality comparisons.
`F16`	100%	No quantization	Debugging a quant-caused regression.

See Quantization tradeoffs for a deeper dive.

imatrix-calibrated quantization

If your store has a replay corpus with enough signal (Sprint 11.6), the export runner automatically builds an imatrix from it and passes --imatrix to llama-quantize. This gives noticeable quality improvements on Q4_K_M and below without changing the API.

Opt out with --no-imatrix if you'd rather have a static quant for comparison.

Just produce GGUFs, skip Ollama

$ uv run dlm export tutor.dlm --quant Q4_K_M --skip-ollama

Useful on CI runners without the Ollama daemon installed. The GGUFs land in exports/Q4_K_M/; wire them into your own runtime.

Other runtime targets

Once the basic GGUF/Ollama flow is familiar, the same store can export to:

--target llama-server for a vendored llama.cpp HTTP server
--target vllm for HF-snapshot + LoRA-module serving
--target mlx-serve for Apple Silicon text serving through mlx_lm.server

Those targets have different prerequisites and artifact layouts, so they live in the multi-target export cookbook instead of this first-run page.

Want to send the whole training history to a friend? The Sharing with pack cookbook shows the dlm pack / dlm unpack round trip.

View source

  
        1
        # First export
      
        2
        
        3
        `dlm export` converts the base + adapter into GGUF files, writes a
      
        4
        Modelfile with an explicit Go `text/template` (no fuzzy matching),
      
        5
        registers the model with `ollama create`, and runs a smoke prompt.
      
        6
        
        7
        That is still the default path, but it is no longer the only one. Sprint 41
      
        8
        also adds local runtime targets such as `llama-server`, `vllm`, and
      
        9
        `mlx-serve`; see the [multi-target export cookbook](../cookbook/multi-target-export.md)
      
        10
        once you want an OpenAI-compatible local server instead of an Ollama model.
      
        11
        
        12
        ## Prerequisites
      
        13
        
        14
        - `vendor/llama.cpp` submodule is built:
      
        15
          ```sh
      
        16
          $ scripts/bump-llama-cpp.sh build
      
        17
          ```
      
        18
          This compiles `llama-quantize` and `llama-imatrix` under
      
        19
          `vendor/llama.cpp/build/bin/`.
      
        20
        
        21
        - [Ollama](https://ollama.com/) is installed and its daemon is running.
      
        22
          `dlm doctor` reports the minimum version.
      
        23
        
        24
        ## Export
      
        25
        
        26
        ```sh
      
        27
        $ uv run dlm export tutor.dlm --quant Q4_K_M --name my-tutor
      
        28
        export: preflight ok
      
        29
        export: base.Q4_K_M.gguf (47 MiB)
      
        30
        export: adapter.gguf (3 MiB)
      
        31
        export: Modelfile written; ollama create my-tutor:latest
      
        32
        export: smoke: "Hi!" → "Hello! How can I help?"
      
        33
        manifest: exports[-1] recorded at ~/.dlm/store/01KC…/
      
        34
        ```
      
        35
        
        36
        Under the hood:
      
        37
        
        38
        1. The export **preflight** (Sprint 11) checks the adapter config
      
        39
           matches the base architecture, asserts the tokenizer vocab agrees
      
        40
           with the base, validates the chat template, and confirms the
      
        41
           adapter wasn't QLoRA-trained (pitfall #3 — QLoRA merge needs
      
        42
           `--dequantize`).
      
        43
        2. The base model is converted to GGUF and quantized via
      
        44
           `llama-quantize`. The GGUF is cached under
      
        45
           `~/.dlm/store/<id>/exports/Q4_K_M/base.Q4_K_M.gguf` — subsequent
      
        46
           exports at the same quant reuse the file.
      
        47
        3. The LoRA adapter is converted to `adapter.gguf`.
      
        48
        4. An explicit `Modelfile` is emitted with `FROM`, `ADAPTER`, and an
      
        49
           explicit `TEMPLATE "..."` directive (Sprint 12). Ollama will **not**
      
        50
           fuzzy-match the template — the exact Go template for the base's
      
        51
           dialect is committed.
      
        52
        5. `ollama create <name>:latest` registers the model under the Ollama
      
        53
           daemon's control.
      
        54
        6. A smoke prompt runs; the first line of output is recorded in
      
        55
           `manifest.exports[-1].smoke_output_first_line`.
      
        56
        
        57
        ## Quant levels
      
        58
        
        59
        | Quant | Size | Quality | When to use |
      
        60
        |---|---|---|---|
      
        61
        | `Q4_K_M` | ~50% of fp16 | Great default | General-purpose; recommended starting point. |
      
        62
        | `Q5_K_M` | ~60% | Higher quality | Willing to trade more disk for fidelity. |
      
        63
        | `Q8_0` | ~100% of int8 | Near-lossless | Baseline for quality comparisons. |
      
        64
        | `F16` | 100% | No quantization | Debugging a quant-caused regression. |
      
        65
        
        66
        See [Quantization tradeoffs](../cookbook/quantization-tradeoffs.md) for
      
        67
        a deeper dive.
      
        68
        
        69
        ## imatrix-calibrated quantization
      
        70
        
        71
        If your store has a replay corpus with enough signal (Sprint 11.6),
      
        72
        the export runner automatically builds an imatrix from it and passes
      
        73
        `--imatrix` to `llama-quantize`. This gives noticeable quality
      
        74
        improvements on `Q4_K_M` and below without changing the API.
      
        75
        
        76
        Opt out with `--no-imatrix` if you'd rather have a static quant for
      
        77
        comparison.
      
        78
        
        79
        ## Just produce GGUFs, skip Ollama
      
        80
        
        81
        ```sh
      
        82
        $ uv run dlm export tutor.dlm --quant Q4_K_M --skip-ollama
      
        83
        ```
      
        84
        
        85
        Useful on CI runners without the Ollama daemon installed. The GGUFs
      
        86
        land in `exports/Q4_K_M/`; wire them into your own runtime.
      
        87
        
        88
        ## Other runtime targets
      
        89
        
        90
        Once the basic GGUF/Ollama flow is familiar, the same store can export to:
      
        91
        
        92
        - `--target llama-server` for a vendored `llama.cpp` HTTP server
      
        93
        - `--target vllm` for HF-snapshot + LoRA-module serving
      
        94
        - `--target mlx-serve` for Apple Silicon text serving through `mlx_lm.server`
      
        95
        
        96
        Those targets have different prerequisites and artifact layouts, so they live
      
        97
        in the [multi-target export cookbook](../cookbook/multi-target-export.md)
      
        98
        instead of this first-run page.
      
        99
        
        100
        ## Next
      
        101
        
        102
        Want to send the whole training history to a friend? The
      
        103
        [Sharing with pack](../cookbook/sharing-with-pack.md) cookbook shows
      
        104
        the `dlm pack` / `dlm unpack` round trip.

1	# First export
2
3	`dlm export` converts the base + adapter into GGUF files, writes a
4	Modelfile with an explicit Go `text/template` (no fuzzy matching),
5	registers the model with `ollama create`, and runs a smoke prompt.
6
7	That is still the default path, but it is no longer the only one. Sprint 41
8	also adds local runtime targets such as `llama-server`, `vllm`, and
9	`mlx-serve`; see the [multi-target export cookbook](../cookbook/multi-target-export.md)
10	once you want an OpenAI-compatible local server instead of an Ollama model.
11
12	## Prerequisites
13
14	- `vendor/llama.cpp` submodule is built:
15	```sh
16	$ scripts/bump-llama-cpp.sh build
17	```
18	This compiles `llama-quantize` and `llama-imatrix` under
19	`vendor/llama.cpp/build/bin/`.
20
21	- [Ollama](https://ollama.com/) is installed and its daemon is running.
22	`dlm doctor` reports the minimum version.
23
24	## Export
25
26	```sh
27	$ uv run dlm export tutor.dlm --quant Q4_K_M --name my-tutor
28	export: preflight ok
29	export: base.Q4_K_M.gguf (47 MiB)
30	export: adapter.gguf (3 MiB)
31	export: Modelfile written; ollama create my-tutor:latest
32	export: smoke: "Hi!" → "Hello! How can I help?"
33	manifest: exports[-1] recorded at ~/.dlm/store/01KC…/
34	```
35
36	Under the hood:
37
38	1. The export preflight (Sprint 11) checks the adapter config
39	matches the base architecture, asserts the tokenizer vocab agrees
40	with the base, validates the chat template, and confirms the
41	adapter wasn't QLoRA-trained (pitfall #3 — QLoRA merge needs
42	`--dequantize`).
43	2. The base model is converted to GGUF and quantized via
44	`llama-quantize`. The GGUF is cached under
45	`~/.dlm/store/<id>/exports/Q4_K_M/base.Q4_K_M.gguf` — subsequent
46	exports at the same quant reuse the file.
47	3. The LoRA adapter is converted to `adapter.gguf`.
48	4. An explicit `Modelfile` is emitted with `FROM`, `ADAPTER`, and an
49	explicit `TEMPLATE "..."` directive (Sprint 12). Ollama will not
50	fuzzy-match the template — the exact Go template for the base's
51	dialect is committed.
52	5. `ollama create <name>:latest` registers the model under the Ollama
53	daemon's control.
54	6. A smoke prompt runs; the first line of output is recorded in
55	`manifest.exports[-1].smoke_output_first_line`.
56
57	## Quant levels
58
59	\| Quant \| Size \| Quality \| When to use \|
60	\|---\|---\|---\|---\|
61	\| `Q4_K_M` \| ~50% of fp16 \| Great default \| General-purpose; recommended starting point. \|
62	\| `Q5_K_M` \| ~60% \| Higher quality \| Willing to trade more disk for fidelity. \|
63	\| `Q8_0` \| ~100% of int8 \| Near-lossless \| Baseline for quality comparisons. \|
64	\| `F16` \| 100% \| No quantization \| Debugging a quant-caused regression. \|
65
66	See [Quantization tradeoffs](../cookbook/quantization-tradeoffs.md) for
67	a deeper dive.
68
69	## imatrix-calibrated quantization
70
71	If your store has a replay corpus with enough signal (Sprint 11.6),
72	the export runner automatically builds an imatrix from it and passes
73	`--imatrix` to `llama-quantize`. This gives noticeable quality
74	improvements on `Q4_K_M` and below without changing the API.
75
76	Opt out with `--no-imatrix` if you'd rather have a static quant for
77	comparison.
78
79	## Just produce GGUFs, skip Ollama
80
81	```sh
82	$ uv run dlm export tutor.dlm --quant Q4_K_M --skip-ollama
83	```
84
85	Useful on CI runners without the Ollama daemon installed. The GGUFs
86	land in `exports/Q4_K_M/`; wire them into your own runtime.
87
88	## Other runtime targets
89
90	Once the basic GGUF/Ollama flow is familiar, the same store can export to:
91
92	- `--target llama-server` for a vendored `llama.cpp` HTTP server
93	- `--target vllm` for HF-snapshot + LoRA-module serving
94	- `--target mlx-serve` for Apple Silicon text serving through `mlx_lm.server`
95
96	Those targets have different prerequisites and artifact layouts, so they live
97	in the [multi-target export cookbook](../cookbook/multi-target-export.md)
98	instead of this first-run page.
99
100	## Next
101
102	Want to send the whole training history to a friend? The
103	[Sharing with pack](../cookbook/sharing-with-pack.md) cookbook shows
104	the `dlm pack` / `dlm unpack` round trip.