documentlanguagemodel Public
Multi-target export
dlm export is no longer just an Ollama registration path. The same
trained store can now emit local runtime artifacts for four targets:
ollamafor managed local registration plus the existing Modelfile flowllama-serverfor GGUF-backed OpenAI-compatible HTTP serving via vendoredllama.cppvllmfor HF-snapshot plus LoRA-module serving on machines that can runvllmmlx-servefor Apple Silicon text serving throughmlx_lm.server
Use this when you want one training loop but different local runtimes for prompting, evaluation harnesses, agents, or deployment experiments.
Quick map
| Target | Best for | Artifact shape | Smoke path |
|---|---|---|---|
ollama |
Easiest local chat loop | GGUF + Modelfile + local registration |
existing Ollama smoke |
llama-server |
GGUF-backed OpenAI-compatible server | base.<quant>.gguf + adapter.gguf + chat-template.jinja + llama-server_launch.sh |
shared HTTP smoke |
vllm |
HF-snapshot + LoRA serving on supported hosts | vllm_launch.sh + vllm_config.json + staged adapters |
shared HTTP smoke |
mlx-serve |
Apple Silicon text serving without GGUF conversion | mlx_serve_launch.sh + staged MLX adapter dir |
shared HTTP smoke |
Prerequisites
Ollama
brew install ollama
llama-server
scripts/bump-llama-cpp.sh build --with-server
That compiles the vendored llama-server binary alongside the GGUF tooling.
vLLM
Install a compatible vllm runtime in the environment you plan to launch
from. DLM writes the launch/config artifacts, but it does not bundle the
server runtime.
On Apple Silicon, the generated vllm launch path is deliberately cautious:
VLLM_METAL_USE_PAGED_ATTENTION=0VLLM_METAL_MEMORY_FRACTION=auto--max-model-lencapped to the document'straining.sequence_len
Those defaults exist to avoid the Metal OOM / hang pattern that shows up when
vllm-metal blindly asks for the base model's full context window.
MLX-serve
uv sync --extra mlx
mlx-serve is Apple Silicon only. DLM refuses it on CUDA, ROCm, and CPU-only
hosts, and this Sprint 41 slice only supports text bases on that target.
Common exports
Ollama
uv run dlm export tutor.dlm --target ollama --name my-tutor
This is the classic DLM path: GGUF conversion, explicit Go-template
Modelfile, optional registration, and an Ollama smoke prompt.
llama-server
uv run dlm export tutor.dlm --target llama-server
bash ~/.dlm/store/<dlm_id>/exports/Q4_K_M/llama-server_launch.sh
This reuses the GGUF export artifacts and adds:
chat-template.jinjallama-server_launch.shtarget: "llama-server"inexport_manifest.json
The launch script binds 127.0.0.1 and speaks /v1/chat/completions.
vLLM
uv run dlm export tutor.dlm --target vllm
bash ~/.dlm/store/<dlm_id>/exports/vllm/vllm_launch.sh
This path stages local LoRA modules and writes:
vllm_launch.shvllm_config.jsonexports/vllm/adapters/...
Flags that only matter to GGUF or Ollama are ignored with a banner:
--quant, --merged, --dequantize, --no-template, --skip-ollama,
--no-imatrix, --draft, --no-draft.
MLX-serve
uv run dlm export tutor.dlm --target mlx-serve
bash ~/.dlm/store/<dlm_id>/exports/mlx-serve/mlx_serve_launch.sh
This path stages an MLX-loadable adapter directory and writes:
mlx_serve_launch.shexports/mlx-serve/adapter/or one named adapter directorytarget: "mlx-serve"inexport_manifest.json
mlx-serve also ignores the GGUF/Ollama-only flags above, plus --name.
Multi-adapter behavior
The runtime targets split into two families:
ollamaandllama-servercan reuse the GGUF weighted-merge path for--adapter-mixvllmandmlx-servework from local adapter directories
For vllm:
- single-adapter docs export one staged module
- multi-adapter docs without
--adapterexport every named adapter as a--lora-moduleslist --adapter-mixexports the staged composite adapter instead
For mlx-serve:
- single-adapter docs export the current flat adapter
- multi-adapter docs must choose one adapter with
--adapter, or pass--adapter-mixto export the staged composite adapter
That "one adapter at a time" rule is intentional: this target is a simple local-serving path, not a dynamic multi-LoRA router.
Smoke behavior
All three HTTP targets use the shared OpenAI-compatible smoke harness:
- reserve a loopback port
- launch the target-specific server command
- poll
/v1/models - POST
/v1/chat/completions - record the first non-empty line in the store manifest
Skip it with --no-smoke when the runtime is not installed or you want the
artifacts only.
Inspecting what got written
Every export writes export_manifest.json under its target directory. The
important fields are:
targetquantartifactsadapter_versionbase_model_hf_idbase_model_revision
The per-store manifest.json also gets an appended exports[-1] row with the
same target plus the smoke first line when a smoke test ran.
See Export manifest for the exact schema.
View source
| 1 | # Multi-target export |
| 2 | |
| 3 | `dlm export` is no longer just an Ollama registration path. The same |
| 4 | trained store can now emit local runtime artifacts for four targets: |
| 5 | |
| 6 | - `ollama` for managed local registration plus the existing Modelfile flow |
| 7 | - `llama-server` for GGUF-backed OpenAI-compatible HTTP serving via vendored |
| 8 | `llama.cpp` |
| 9 | - `vllm` for HF-snapshot plus LoRA-module serving on machines that can run |
| 10 | `vllm` |
| 11 | - `mlx-serve` for Apple Silicon text serving through `mlx_lm.server` |
| 12 | |
| 13 | Use this when you want one training loop but different local runtimes for |
| 14 | prompting, evaluation harnesses, agents, or deployment experiments. |
| 15 | |
| 16 | ## Quick map |
| 17 | |
| 18 | | Target | Best for | Artifact shape | Smoke path | |
| 19 | |---|---|---|---| |
| 20 | | `ollama` | Easiest local chat loop | GGUF + `Modelfile` + local registration | existing Ollama smoke | |
| 21 | | `llama-server` | GGUF-backed OpenAI-compatible server | `base.<quant>.gguf` + `adapter.gguf` + `chat-template.jinja` + `llama-server_launch.sh` | shared HTTP smoke | |
| 22 | | `vllm` | HF-snapshot + LoRA serving on supported hosts | `vllm_launch.sh` + `vllm_config.json` + staged adapters | shared HTTP smoke | |
| 23 | | `mlx-serve` | Apple Silicon text serving without GGUF conversion | `mlx_serve_launch.sh` + staged MLX adapter dir | shared HTTP smoke | |
| 24 | |
| 25 | ## Prerequisites |
| 26 | |
| 27 | ### Ollama |
| 28 | |
| 29 | ```sh |
| 30 | brew install ollama |
| 31 | ``` |
| 32 | |
| 33 | ### llama-server |
| 34 | |
| 35 | ```sh |
| 36 | scripts/bump-llama-cpp.sh build --with-server |
| 37 | ``` |
| 38 | |
| 39 | That compiles the vendored `llama-server` binary alongside the GGUF tooling. |
| 40 | |
| 41 | ### vLLM |
| 42 | |
| 43 | Install a compatible `vllm` runtime in the environment you plan to launch |
| 44 | from. DLM writes the launch/config artifacts, but it does not bundle the |
| 45 | server runtime. |
| 46 | |
| 47 | On Apple Silicon, the generated `vllm` launch path is deliberately cautious: |
| 48 | |
| 49 | - `VLLM_METAL_USE_PAGED_ATTENTION=0` |
| 50 | - `VLLM_METAL_MEMORY_FRACTION=auto` |
| 51 | - `--max-model-len` capped to the document's `training.sequence_len` |
| 52 | |
| 53 | Those defaults exist to avoid the Metal OOM / hang pattern that shows up when |
| 54 | `vllm-metal` blindly asks for the base model's full context window. |
| 55 | |
| 56 | ### MLX-serve |
| 57 | |
| 58 | ```sh |
| 59 | uv sync --extra mlx |
| 60 | ``` |
| 61 | |
| 62 | `mlx-serve` is Apple Silicon only. DLM refuses it on CUDA, ROCm, and CPU-only |
| 63 | hosts, and this Sprint 41 slice only supports text bases on that target. |
| 64 | |
| 65 | ## Common exports |
| 66 | |
| 67 | ### Ollama |
| 68 | |
| 69 | ```sh |
| 70 | uv run dlm export tutor.dlm --target ollama --name my-tutor |
| 71 | ``` |
| 72 | |
| 73 | This is the classic DLM path: GGUF conversion, explicit Go-template |
| 74 | `Modelfile`, optional registration, and an Ollama smoke prompt. |
| 75 | |
| 76 | ### llama-server |
| 77 | |
| 78 | ```sh |
| 79 | uv run dlm export tutor.dlm --target llama-server |
| 80 | bash ~/.dlm/store/<dlm_id>/exports/Q4_K_M/llama-server_launch.sh |
| 81 | ``` |
| 82 | |
| 83 | This reuses the GGUF export artifacts and adds: |
| 84 | |
| 85 | - `chat-template.jinja` |
| 86 | - `llama-server_launch.sh` |
| 87 | - `target: "llama-server"` in `export_manifest.json` |
| 88 | |
| 89 | The launch script binds `127.0.0.1` and speaks `/v1/chat/completions`. |
| 90 | |
| 91 | ### vLLM |
| 92 | |
| 93 | ```sh |
| 94 | uv run dlm export tutor.dlm --target vllm |
| 95 | bash ~/.dlm/store/<dlm_id>/exports/vllm/vllm_launch.sh |
| 96 | ``` |
| 97 | |
| 98 | This path stages local LoRA modules and writes: |
| 99 | |
| 100 | - `vllm_launch.sh` |
| 101 | - `vllm_config.json` |
| 102 | - `exports/vllm/adapters/...` |
| 103 | |
| 104 | Flags that only matter to GGUF or Ollama are ignored with a banner: |
| 105 | `--quant`, `--merged`, `--dequantize`, `--no-template`, `--skip-ollama`, |
| 106 | `--no-imatrix`, `--draft`, `--no-draft`. |
| 107 | |
| 108 | ### MLX-serve |
| 109 | |
| 110 | ```sh |
| 111 | uv run dlm export tutor.dlm --target mlx-serve |
| 112 | bash ~/.dlm/store/<dlm_id>/exports/mlx-serve/mlx_serve_launch.sh |
| 113 | ``` |
| 114 | |
| 115 | This path stages an MLX-loadable adapter directory and writes: |
| 116 | |
| 117 | - `mlx_serve_launch.sh` |
| 118 | - `exports/mlx-serve/adapter/` or one named adapter directory |
| 119 | - `target: "mlx-serve"` in `export_manifest.json` |
| 120 | |
| 121 | `mlx-serve` also ignores the GGUF/Ollama-only flags above, plus `--name`. |
| 122 | |
| 123 | ## Multi-adapter behavior |
| 124 | |
| 125 | The runtime targets split into two families: |
| 126 | |
| 127 | - `ollama` and `llama-server` can reuse the GGUF weighted-merge path for |
| 128 | `--adapter-mix` |
| 129 | - `vllm` and `mlx-serve` work from local adapter directories |
| 130 | |
| 131 | For `vllm`: |
| 132 | |
| 133 | - single-adapter docs export one staged module |
| 134 | - multi-adapter docs without `--adapter` export every named adapter as a |
| 135 | `--lora-modules` list |
| 136 | - `--adapter-mix` exports the staged composite adapter instead |
| 137 | |
| 138 | For `mlx-serve`: |
| 139 | |
| 140 | - single-adapter docs export the current flat adapter |
| 141 | - multi-adapter docs must choose one adapter with `--adapter`, or pass |
| 142 | `--adapter-mix` to export the staged composite adapter |
| 143 | |
| 144 | That "one adapter at a time" rule is intentional: this target is a simple |
| 145 | local-serving path, not a dynamic multi-LoRA router. |
| 146 | |
| 147 | ## Smoke behavior |
| 148 | |
| 149 | All three HTTP targets use the shared OpenAI-compatible smoke harness: |
| 150 | |
| 151 | 1. reserve a loopback port |
| 152 | 2. launch the target-specific server command |
| 153 | 3. poll `/v1/models` |
| 154 | 4. POST `/v1/chat/completions` |
| 155 | 5. record the first non-empty line in the store manifest |
| 156 | |
| 157 | Skip it with `--no-smoke` when the runtime is not installed or you want the |
| 158 | artifacts only. |
| 159 | |
| 160 | ## Inspecting what got written |
| 161 | |
| 162 | Every export writes `export_manifest.json` under its target directory. The |
| 163 | important fields are: |
| 164 | |
| 165 | - `target` |
| 166 | - `quant` |
| 167 | - `artifacts` |
| 168 | - `adapter_version` |
| 169 | - `base_model_hf_id` |
| 170 | - `base_model_revision` |
| 171 | |
| 172 | The per-store `manifest.json` also gets an appended `exports[-1]` row with the |
| 173 | same `target` plus the smoke first line when a smoke test ran. |
| 174 | |
| 175 | See [Export manifest](../format/export-manifest.md) for the exact schema. |