markdown · 5242 bytes Raw Blame History

Multi-target export

dlm export is no longer just an Ollama registration path. The same trained store can now emit local runtime artifacts for four targets:

  • ollama for managed local registration plus the existing Modelfile flow
  • llama-server for GGUF-backed OpenAI-compatible HTTP serving via vendored llama.cpp
  • vllm for HF-snapshot plus LoRA-module serving on machines that can run vllm
  • mlx-serve for Apple Silicon text serving through mlx_lm.server

Use this when you want one training loop but different local runtimes for prompting, evaluation harnesses, agents, or deployment experiments.

Quick map

Target Best for Artifact shape Smoke path
ollama Easiest local chat loop GGUF + Modelfile + local registration existing Ollama smoke
llama-server GGUF-backed OpenAI-compatible server base.<quant>.gguf + adapter.gguf + chat-template.jinja + llama-server_launch.sh shared HTTP smoke
vllm HF-snapshot + LoRA serving on supported hosts vllm_launch.sh + vllm_config.json + staged adapters shared HTTP smoke
mlx-serve Apple Silicon text serving without GGUF conversion mlx_serve_launch.sh + staged MLX adapter dir shared HTTP smoke

Prerequisites

Ollama

brew install ollama

llama-server

scripts/bump-llama-cpp.sh build --with-server

That compiles the vendored llama-server binary alongside the GGUF tooling.

vLLM

Install a compatible vllm runtime in the environment you plan to launch from. DLM writes the launch/config artifacts, but it does not bundle the server runtime.

On Apple Silicon, the generated vllm launch path is deliberately cautious:

  • VLLM_METAL_USE_PAGED_ATTENTION=0
  • VLLM_METAL_MEMORY_FRACTION=auto
  • --max-model-len capped to the document's training.sequence_len

Those defaults exist to avoid the Metal OOM / hang pattern that shows up when vllm-metal blindly asks for the base model's full context window.

MLX-serve

uv sync --extra mlx

mlx-serve is Apple Silicon only. DLM refuses it on CUDA, ROCm, and CPU-only hosts, and this Sprint 41 slice only supports text bases on that target.

Common exports

Ollama

uv run dlm export tutor.dlm --target ollama --name my-tutor

This is the classic DLM path: GGUF conversion, explicit Go-template Modelfile, optional registration, and an Ollama smoke prompt.

llama-server

uv run dlm export tutor.dlm --target llama-server
bash ~/.dlm/store/<dlm_id>/exports/Q4_K_M/llama-server_launch.sh

This reuses the GGUF export artifacts and adds:

  • chat-template.jinja
  • llama-server_launch.sh
  • target: "llama-server" in export_manifest.json

The launch script binds 127.0.0.1 and speaks /v1/chat/completions.

vLLM

uv run dlm export tutor.dlm --target vllm
bash ~/.dlm/store/<dlm_id>/exports/vllm/vllm_launch.sh

This path stages local LoRA modules and writes:

  • vllm_launch.sh
  • vllm_config.json
  • exports/vllm/adapters/...

Flags that only matter to GGUF or Ollama are ignored with a banner: --quant, --merged, --dequantize, --no-template, --skip-ollama, --no-imatrix, --draft, --no-draft.

MLX-serve

uv run dlm export tutor.dlm --target mlx-serve
bash ~/.dlm/store/<dlm_id>/exports/mlx-serve/mlx_serve_launch.sh

This path stages an MLX-loadable adapter directory and writes:

  • mlx_serve_launch.sh
  • exports/mlx-serve/adapter/ or one named adapter directory
  • target: "mlx-serve" in export_manifest.json

mlx-serve also ignores the GGUF/Ollama-only flags above, plus --name.

Multi-adapter behavior

The runtime targets split into two families:

  • ollama and llama-server can reuse the GGUF weighted-merge path for --adapter-mix
  • vllm and mlx-serve work from local adapter directories

For vllm:

  • single-adapter docs export one staged module
  • multi-adapter docs without --adapter export every named adapter as a --lora-modules list
  • --adapter-mix exports the staged composite adapter instead

For mlx-serve:

  • single-adapter docs export the current flat adapter
  • multi-adapter docs must choose one adapter with --adapter, or pass --adapter-mix to export the staged composite adapter

That "one adapter at a time" rule is intentional: this target is a simple local-serving path, not a dynamic multi-LoRA router.

Smoke behavior

All three HTTP targets use the shared OpenAI-compatible smoke harness:

  1. reserve a loopback port
  2. launch the target-specific server command
  3. poll /v1/models
  4. POST /v1/chat/completions
  5. record the first non-empty line in the store manifest

Skip it with --no-smoke when the runtime is not installed or you want the artifacts only.

Inspecting what got written

Every export writes export_manifest.json under its target directory. The important fields are:

  • target
  • quant
  • artifacts
  • adapter_version
  • base_model_hf_id
  • base_model_revision

The per-store manifest.json also gets an appended exports[-1] row with the same target plus the smoke first line when a smoke test ran.

See Export manifest for the exact schema.

View source
1 # Multi-target export
2
3 `dlm export` is no longer just an Ollama registration path. The same
4 trained store can now emit local runtime artifacts for four targets:
5
6 - `ollama` for managed local registration plus the existing Modelfile flow
7 - `llama-server` for GGUF-backed OpenAI-compatible HTTP serving via vendored
8 `llama.cpp`
9 - `vllm` for HF-snapshot plus LoRA-module serving on machines that can run
10 `vllm`
11 - `mlx-serve` for Apple Silicon text serving through `mlx_lm.server`
12
13 Use this when you want one training loop but different local runtimes for
14 prompting, evaluation harnesses, agents, or deployment experiments.
15
16 ## Quick map
17
18 | Target | Best for | Artifact shape | Smoke path |
19 |---|---|---|---|
20 | `ollama` | Easiest local chat loop | GGUF + `Modelfile` + local registration | existing Ollama smoke |
21 | `llama-server` | GGUF-backed OpenAI-compatible server | `base.<quant>.gguf` + `adapter.gguf` + `chat-template.jinja` + `llama-server_launch.sh` | shared HTTP smoke |
22 | `vllm` | HF-snapshot + LoRA serving on supported hosts | `vllm_launch.sh` + `vllm_config.json` + staged adapters | shared HTTP smoke |
23 | `mlx-serve` | Apple Silicon text serving without GGUF conversion | `mlx_serve_launch.sh` + staged MLX adapter dir | shared HTTP smoke |
24
25 ## Prerequisites
26
27 ### Ollama
28
29 ```sh
30 brew install ollama
31 ```
32
33 ### llama-server
34
35 ```sh
36 scripts/bump-llama-cpp.sh build --with-server
37 ```
38
39 That compiles the vendored `llama-server` binary alongside the GGUF tooling.
40
41 ### vLLM
42
43 Install a compatible `vllm` runtime in the environment you plan to launch
44 from. DLM writes the launch/config artifacts, but it does not bundle the
45 server runtime.
46
47 On Apple Silicon, the generated `vllm` launch path is deliberately cautious:
48
49 - `VLLM_METAL_USE_PAGED_ATTENTION=0`
50 - `VLLM_METAL_MEMORY_FRACTION=auto`
51 - `--max-model-len` capped to the document's `training.sequence_len`
52
53 Those defaults exist to avoid the Metal OOM / hang pattern that shows up when
54 `vllm-metal` blindly asks for the base model's full context window.
55
56 ### MLX-serve
57
58 ```sh
59 uv sync --extra mlx
60 ```
61
62 `mlx-serve` is Apple Silicon only. DLM refuses it on CUDA, ROCm, and CPU-only
63 hosts, and this Sprint 41 slice only supports text bases on that target.
64
65 ## Common exports
66
67 ### Ollama
68
69 ```sh
70 uv run dlm export tutor.dlm --target ollama --name my-tutor
71 ```
72
73 This is the classic DLM path: GGUF conversion, explicit Go-template
74 `Modelfile`, optional registration, and an Ollama smoke prompt.
75
76 ### llama-server
77
78 ```sh
79 uv run dlm export tutor.dlm --target llama-server
80 bash ~/.dlm/store/<dlm_id>/exports/Q4_K_M/llama-server_launch.sh
81 ```
82
83 This reuses the GGUF export artifacts and adds:
84
85 - `chat-template.jinja`
86 - `llama-server_launch.sh`
87 - `target: "llama-server"` in `export_manifest.json`
88
89 The launch script binds `127.0.0.1` and speaks `/v1/chat/completions`.
90
91 ### vLLM
92
93 ```sh
94 uv run dlm export tutor.dlm --target vllm
95 bash ~/.dlm/store/<dlm_id>/exports/vllm/vllm_launch.sh
96 ```
97
98 This path stages local LoRA modules and writes:
99
100 - `vllm_launch.sh`
101 - `vllm_config.json`
102 - `exports/vllm/adapters/...`
103
104 Flags that only matter to GGUF or Ollama are ignored with a banner:
105 `--quant`, `--merged`, `--dequantize`, `--no-template`, `--skip-ollama`,
106 `--no-imatrix`, `--draft`, `--no-draft`.
107
108 ### MLX-serve
109
110 ```sh
111 uv run dlm export tutor.dlm --target mlx-serve
112 bash ~/.dlm/store/<dlm_id>/exports/mlx-serve/mlx_serve_launch.sh
113 ```
114
115 This path stages an MLX-loadable adapter directory and writes:
116
117 - `mlx_serve_launch.sh`
118 - `exports/mlx-serve/adapter/` or one named adapter directory
119 - `target: "mlx-serve"` in `export_manifest.json`
120
121 `mlx-serve` also ignores the GGUF/Ollama-only flags above, plus `--name`.
122
123 ## Multi-adapter behavior
124
125 The runtime targets split into two families:
126
127 - `ollama` and `llama-server` can reuse the GGUF weighted-merge path for
128 `--adapter-mix`
129 - `vllm` and `mlx-serve` work from local adapter directories
130
131 For `vllm`:
132
133 - single-adapter docs export one staged module
134 - multi-adapter docs without `--adapter` export every named adapter as a
135 `--lora-modules` list
136 - `--adapter-mix` exports the staged composite adapter instead
137
138 For `mlx-serve`:
139
140 - single-adapter docs export the current flat adapter
141 - multi-adapter docs must choose one adapter with `--adapter`, or pass
142 `--adapter-mix` to export the staged composite adapter
143
144 That "one adapter at a time" rule is intentional: this target is a simple
145 local-serving path, not a dynamic multi-LoRA router.
146
147 ## Smoke behavior
148
149 All three HTTP targets use the shared OpenAI-compatible smoke harness:
150
151 1. reserve a loopback port
152 2. launch the target-specific server command
153 3. poll `/v1/models`
154 4. POST `/v1/chat/completions`
155 5. record the first non-empty line in the store manifest
156
157 Skip it with `--no-smoke` when the runtime is not installed or you want the
158 artifacts only.
159
160 ## Inspecting what got written
161
162 Every export writes `export_manifest.json` under its target directory. The
163 important fields are:
164
165 - `target`
166 - `quant`
167 - `artifacts`
168 - `adapter_version`
169 - `base_model_hf_id`
170 - `base_model_revision`
171
172 The per-store `manifest.json` also gets an appended `exports[-1]` row with the
173 same `target` plus the smoke first line when a smoke test ran.
174
175 See [Export manifest](../format/export-manifest.md) for the exact schema.