documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 5242 bytes Raw Blame History

Multi-target export

dlm export is no longer just an Ollama registration path. The same trained store can now emit local runtime artifacts for four targets:

ollama for managed local registration plus the existing Modelfile flow
llama-server for GGUF-backed OpenAI-compatible HTTP serving via vendored llama.cpp
vllm for HF-snapshot plus LoRA-module serving on machines that can run vllm
mlx-serve for Apple Silicon text serving through mlx_lm.server

Use this when you want one training loop but different local runtimes for prompting, evaluation harnesses, agents, or deployment experiments.

Quick map

Target	Best for	Artifact shape	Smoke path
`ollama`	Easiest local chat loop	GGUF + `Modelfile` + local registration	existing Ollama smoke
`llama-server`	GGUF-backed OpenAI-compatible server	`base.<quant>.gguf` + `adapter.gguf` + `chat-template.jinja` + `llama-server_launch.sh`	shared HTTP smoke
`vllm`	HF-snapshot + LoRA serving on supported hosts	`vllm_launch.sh` + `vllm_config.json` + staged adapters	shared HTTP smoke
`mlx-serve`	Apple Silicon text serving without GGUF conversion	`mlx_serve_launch.sh` + staged MLX adapter dir	shared HTTP smoke

Prerequisites

Ollama

brew install ollama

llama-server

scripts/bump-llama-cpp.sh build --with-server

That compiles the vendored llama-server binary alongside the GGUF tooling.

vLLM

Install a compatible vllm runtime in the environment you plan to launch from. DLM writes the launch/config artifacts, but it does not bundle the server runtime.

On Apple Silicon, the generated vllm launch path is deliberately cautious:

VLLM_METAL_USE_PAGED_ATTENTION=0
VLLM_METAL_MEMORY_FRACTION=auto
--max-model-len capped to the document's training.sequence_len

Those defaults exist to avoid the Metal OOM / hang pattern that shows up when vllm-metal blindly asks for the base model's full context window.

MLX-serve

uv sync --extra mlx

mlx-serve is Apple Silicon only. DLM refuses it on CUDA, ROCm, and CPU-only hosts, and this Sprint 41 slice only supports text bases on that target.

Common exports

Ollama

uv run dlm export tutor.dlm --target ollama --name my-tutor

This is the classic DLM path: GGUF conversion, explicit Go-template Modelfile, optional registration, and an Ollama smoke prompt.

llama-server

uv run dlm export tutor.dlm --target llama-server
bash ~/.dlm/store/<dlm_id>/exports/Q4_K_M/llama-server_launch.sh

This reuses the GGUF export artifacts and adds:

chat-template.jinja
llama-server_launch.sh
target: "llama-server" in export_manifest.json

The launch script binds 127.0.0.1 and speaks /v1/chat/completions.

vLLM

uv run dlm export tutor.dlm --target vllm
bash ~/.dlm/store/<dlm_id>/exports/vllm/vllm_launch.sh

This path stages local LoRA modules and writes:

vllm_launch.sh
vllm_config.json
exports/vllm/adapters/...

Flags that only matter to GGUF or Ollama are ignored with a banner: --quant, --merged, --dequantize, --no-template, --skip-ollama, --no-imatrix, --draft, --no-draft.

MLX-serve

uv run dlm export tutor.dlm --target mlx-serve
bash ~/.dlm/store/<dlm_id>/exports/mlx-serve/mlx_serve_launch.sh

This path stages an MLX-loadable adapter directory and writes:

mlx_serve_launch.sh
exports/mlx-serve/adapter/ or one named adapter directory
target: "mlx-serve" in export_manifest.json

mlx-serve also ignores the GGUF/Ollama-only flags above, plus --name.

Multi-adapter behavior

The runtime targets split into two families:

ollama and llama-server can reuse the GGUF weighted-merge path for --adapter-mix
vllm and mlx-serve work from local adapter directories

For vllm:

single-adapter docs export one staged module
multi-adapter docs without --adapter export every named adapter as a --lora-modules list
--adapter-mix exports the staged composite adapter instead

For mlx-serve:

single-adapter docs export the current flat adapter
multi-adapter docs must choose one adapter with --adapter, or pass --adapter-mix to export the staged composite adapter

That "one adapter at a time" rule is intentional: this target is a simple local-serving path, not a dynamic multi-LoRA router.

Smoke behavior

All three HTTP targets use the shared OpenAI-compatible smoke harness:

reserve a loopback port
launch the target-specific server command
poll /v1/models
POST /v1/chat/completions
record the first non-empty line in the store manifest

Skip it with --no-smoke when the runtime is not installed or you want the artifacts only.

Inspecting what got written

Every export writes export_manifest.json under its target directory. The important fields are:

target
quant
artifacts
adapter_version
base_model_hf_id
base_model_revision

The per-store manifest.json also gets an appended exports[-1] row with the same target plus the smoke first line when a smoke test ran.

See Export manifest for the exact schema.

View source

  
        1
        # Multi-target export
      
        2
        
        3
        `dlm export` is no longer just an Ollama registration path. The same
      
        4
        trained store can now emit local runtime artifacts for four targets:
      
        5
        
        6
        - `ollama` for managed local registration plus the existing Modelfile flow
      
        7
        - `llama-server` for GGUF-backed OpenAI-compatible HTTP serving via vendored
      
        8
          `llama.cpp`
      
        9
        - `vllm` for HF-snapshot plus LoRA-module serving on machines that can run
      
        10
          `vllm`
      
        11
        - `mlx-serve` for Apple Silicon text serving through `mlx_lm.server`
      
        12
        
        13
        Use this when you want one training loop but different local runtimes for
      
        14
        prompting, evaluation harnesses, agents, or deployment experiments.
      
        15
        
        16
        ## Quick map
      
        17
        
        18
        | Target | Best for | Artifact shape | Smoke path |
      
        19
        |---|---|---|---|
      
        20
        | `ollama` | Easiest local chat loop | GGUF + `Modelfile` + local registration | existing Ollama smoke |
      
        21
        | `llama-server` | GGUF-backed OpenAI-compatible server | `base.<quant>.gguf` + `adapter.gguf` + `chat-template.jinja` + `llama-server_launch.sh` | shared HTTP smoke |
      
        22
        | `vllm` | HF-snapshot + LoRA serving on supported hosts | `vllm_launch.sh` + `vllm_config.json` + staged adapters | shared HTTP smoke |
      
        23
        | `mlx-serve` | Apple Silicon text serving without GGUF conversion | `mlx_serve_launch.sh` + staged MLX adapter dir | shared HTTP smoke |
      
        24
        
        25
        ## Prerequisites
      
        26
        
        27
        ### Ollama
      
        28
        
        29
        ```sh
      
        30
        brew install ollama
      
        31
        ```
      
        32
        
        33
        ### llama-server
      
        34
        
        35
        ```sh
      
        36
        scripts/bump-llama-cpp.sh build --with-server
      
        37
        ```
      
        38
        
        39
        That compiles the vendored `llama-server` binary alongside the GGUF tooling.
      
        40
        
        41
        ### vLLM
      
        42
        
        43
        Install a compatible `vllm` runtime in the environment you plan to launch
      
        44
        from. DLM writes the launch/config artifacts, but it does not bundle the
      
        45
        server runtime.
      
        46
        
        47
        On Apple Silicon, the generated `vllm` launch path is deliberately cautious:
      
        48
        
        49
        - `VLLM_METAL_USE_PAGED_ATTENTION=0`
      
        50
        - `VLLM_METAL_MEMORY_FRACTION=auto`
      
        51
        - `--max-model-len` capped to the document's `training.sequence_len`
      
        52
        
        53
        Those defaults exist to avoid the Metal OOM / hang pattern that shows up when
      
        54
        `vllm-metal` blindly asks for the base model's full context window.
      
        55
        
        56
        ### MLX-serve
      
        57
        
        58
        ```sh
      
        59
        uv sync --extra mlx
      
        60
        ```
      
        61
        
        62
        `mlx-serve` is Apple Silicon only. DLM refuses it on CUDA, ROCm, and CPU-only
      
        63
        hosts, and this Sprint 41 slice only supports text bases on that target.
      
        64
        
        65
        ## Common exports
      
        66
        
        67
        ### Ollama
      
        68
        
        69
        ```sh
      
        70
        uv run dlm export tutor.dlm --target ollama --name my-tutor
      
        71
        ```
      
        72
        
        73
        This is the classic DLM path: GGUF conversion, explicit Go-template
      
        74
        `Modelfile`, optional registration, and an Ollama smoke prompt.
      
        75
        
        76
        ### llama-server
      
        77
        
        78
        ```sh
      
        79
        uv run dlm export tutor.dlm --target llama-server
      
        80
        bash ~/.dlm/store/<dlm_id>/exports/Q4_K_M/llama-server_launch.sh
      
        81
        ```
      
        82
        
        83
        This reuses the GGUF export artifacts and adds:
      
        84
        
        85
        - `chat-template.jinja`
      
        86
        - `llama-server_launch.sh`
      
        87
        - `target: "llama-server"` in `export_manifest.json`
      
        88
        
        89
        The launch script binds `127.0.0.1` and speaks `/v1/chat/completions`.
      
        90
        
        91
        ### vLLM
      
        92
        
        93
        ```sh
      
        94
        uv run dlm export tutor.dlm --target vllm
      
        95
        bash ~/.dlm/store/<dlm_id>/exports/vllm/vllm_launch.sh
      
        96
        ```
      
        97
        
        98
        This path stages local LoRA modules and writes:
      
        99
        
        100
        - `vllm_launch.sh`
      
        101
        - `vllm_config.json`
      
        102
        - `exports/vllm/adapters/...`
      
        103
        
        104
        Flags that only matter to GGUF or Ollama are ignored with a banner:
      
        105
        `--quant`, `--merged`, `--dequantize`, `--no-template`, `--skip-ollama`,
      
        106
        `--no-imatrix`, `--draft`, `--no-draft`.
      
        107
        
        108
        ### MLX-serve
      
        109
        
        110
        ```sh
      
        111
        uv run dlm export tutor.dlm --target mlx-serve
      
        112
        bash ~/.dlm/store/<dlm_id>/exports/mlx-serve/mlx_serve_launch.sh
      
        113
        ```
      
        114
        
        115
        This path stages an MLX-loadable adapter directory and writes:
      
        116
        
        117
        - `mlx_serve_launch.sh`
      
        118
        - `exports/mlx-serve/adapter/` or one named adapter directory
      
        119
        - `target: "mlx-serve"` in `export_manifest.json`
      
        120
        
        121
        `mlx-serve` also ignores the GGUF/Ollama-only flags above, plus `--name`.
      
        122
        
        123
        ## Multi-adapter behavior
      
        124
        
        125
        The runtime targets split into two families:
      
        126
        
        127
        - `ollama` and `llama-server` can reuse the GGUF weighted-merge path for
      
        128
          `--adapter-mix`
      
        129
        - `vllm` and `mlx-serve` work from local adapter directories
      
        130
        
        131
        For `vllm`:
      
        132
        
        133
        - single-adapter docs export one staged module
      
        134
        - multi-adapter docs without `--adapter` export every named adapter as a
      
        135
          `--lora-modules` list
      
        136
        - `--adapter-mix` exports the staged composite adapter instead
      
        137
        
        138
        For `mlx-serve`:
      
        139
        
        140
        - single-adapter docs export the current flat adapter
      
        141
        - multi-adapter docs must choose one adapter with `--adapter`, or pass
      
        142
          `--adapter-mix` to export the staged composite adapter
      
        143
        
        144
        That "one adapter at a time" rule is intentional: this target is a simple
      
        145
        local-serving path, not a dynamic multi-LoRA router.
      
        146
        
        147
        ## Smoke behavior
      
        148
        
        149
        All three HTTP targets use the shared OpenAI-compatible smoke harness:
      
        150
        
        151
        1. reserve a loopback port
      
        152
        2. launch the target-specific server command
      
        153
        3. poll `/v1/models`
      
        154
        4. POST `/v1/chat/completions`
      
        155
        5. record the first non-empty line in the store manifest
      
        156
        
        157
        Skip it with `--no-smoke` when the runtime is not installed or you want the
      
        158
        artifacts only.
      
        159
        
        160
        ## Inspecting what got written
      
        161
        
        162
        Every export writes `export_manifest.json` under its target directory. The
      
        163
        important fields are:
      
        164
        
        165
        - `target`
      
        166
        - `quant`
      
        167
        - `artifacts`
      
        168
        - `adapter_version`
      
        169
        - `base_model_hf_id`
      
        170
        - `base_model_revision`
      
        171
        
        172
        The per-store `manifest.json` also gets an appended `exports[-1]` row with the
      
        173
        same `target` plus the smoke first line when a smoke test ran.
      
        174
        
        175
        See [Export manifest](../format/export-manifest.md) for the exact schema.

1	# Multi-target export
2
3	`dlm export` is no longer just an Ollama registration path. The same
4	trained store can now emit local runtime artifacts for four targets:
5
6	- `ollama` for managed local registration plus the existing Modelfile flow
7	- `llama-server` for GGUF-backed OpenAI-compatible HTTP serving via vendored
8	`llama.cpp`
9	- `vllm` for HF-snapshot plus LoRA-module serving on machines that can run
10	`vllm`
11	- `mlx-serve` for Apple Silicon text serving through `mlx_lm.server`
12
13	Use this when you want one training loop but different local runtimes for
14	prompting, evaluation harnesses, agents, or deployment experiments.
15
16	## Quick map
17
18	\| Target \| Best for \| Artifact shape \| Smoke path \|
19	\|---\|---\|---\|---\|
20	\| `ollama` \| Easiest local chat loop \| GGUF + `Modelfile` + local registration \| existing Ollama smoke \|
21	\| `llama-server` \| GGUF-backed OpenAI-compatible server \| `base.<quant>.gguf` + `adapter.gguf` + `chat-template.jinja` + `llama-server_launch.sh` \| shared HTTP smoke \|
22	\| `vllm` \| HF-snapshot + LoRA serving on supported hosts \| `vllm_launch.sh` + `vllm_config.json` + staged adapters \| shared HTTP smoke \|
23	\| `mlx-serve` \| Apple Silicon text serving without GGUF conversion \| `mlx_serve_launch.sh` + staged MLX adapter dir \| shared HTTP smoke \|
24
25	## Prerequisites
26
27	### Ollama
28
29	```sh
30	brew install ollama
31	```
32
33	### llama-server
34
35	```sh
36	scripts/bump-llama-cpp.sh build --with-server
37	```
38
39	That compiles the vendored `llama-server` binary alongside the GGUF tooling.
40
41	### vLLM
42
43	Install a compatible `vllm` runtime in the environment you plan to launch
44	from. DLM writes the launch/config artifacts, but it does not bundle the
45	server runtime.
46
47	On Apple Silicon, the generated `vllm` launch path is deliberately cautious:
48
49	- `VLLM_METAL_USE_PAGED_ATTENTION=0`
50	- `VLLM_METAL_MEMORY_FRACTION=auto`
51	- `--max-model-len` capped to the document's `training.sequence_len`
52
53	Those defaults exist to avoid the Metal OOM / hang pattern that shows up when
54	`vllm-metal` blindly asks for the base model's full context window.
55
56	### MLX-serve
57
58	```sh
59	uv sync --extra mlx
60	```
61
62	`mlx-serve` is Apple Silicon only. DLM refuses it on CUDA, ROCm, and CPU-only
63	hosts, and this Sprint 41 slice only supports text bases on that target.
64
65	## Common exports
66
67	### Ollama
68
69	```sh
70	uv run dlm export tutor.dlm --target ollama --name my-tutor
71	```
72
73	This is the classic DLM path: GGUF conversion, explicit Go-template
74	`Modelfile`, optional registration, and an Ollama smoke prompt.
75
76	### llama-server
77
78	```sh
79	uv run dlm export tutor.dlm --target llama-server
80	bash ~/.dlm/store/<dlm_id>/exports/Q4_K_M/llama-server_launch.sh
81	```
82
83	This reuses the GGUF export artifacts and adds:
84
85	- `chat-template.jinja`
86	- `llama-server_launch.sh`
87	- `target: "llama-server"` in `export_manifest.json`
88
89	The launch script binds `127.0.0.1` and speaks `/v1/chat/completions`.
90
91	### vLLM
92
93	```sh
94	uv run dlm export tutor.dlm --target vllm
95	bash ~/.dlm/store/<dlm_id>/exports/vllm/vllm_launch.sh
96	```
97
98	This path stages local LoRA modules and writes:
99
100	- `vllm_launch.sh`
101	- `vllm_config.json`
102	- `exports/vllm/adapters/...`
103
104	Flags that only matter to GGUF or Ollama are ignored with a banner:
105	`--quant`, `--merged`, `--dequantize`, `--no-template`, `--skip-ollama`,
106	`--no-imatrix`, `--draft`, `--no-draft`.
107
108	### MLX-serve
109
110	```sh
111	uv run dlm export tutor.dlm --target mlx-serve
112	bash ~/.dlm/store/<dlm_id>/exports/mlx-serve/mlx_serve_launch.sh
113	```
114
115	This path stages an MLX-loadable adapter directory and writes:
116
117	- `mlx_serve_launch.sh`
118	- `exports/mlx-serve/adapter/` or one named adapter directory
119	- `target: "mlx-serve"` in `export_manifest.json`
120
121	`mlx-serve` also ignores the GGUF/Ollama-only flags above, plus `--name`.
122
123	## Multi-adapter behavior
124
125	The runtime targets split into two families:
126
127	- `ollama` and `llama-server` can reuse the GGUF weighted-merge path for
128	`--adapter-mix`
129	- `vllm` and `mlx-serve` work from local adapter directories
130
131	For `vllm`:
132
133	- single-adapter docs export one staged module
134	- multi-adapter docs without `--adapter` export every named adapter as a
135	`--lora-modules` list
136	- `--adapter-mix` exports the staged composite adapter instead
137
138	For `mlx-serve`:
139
140	- single-adapter docs export the current flat adapter
141	- multi-adapter docs must choose one adapter with `--adapter`, or pass
142	`--adapter-mix` to export the staged composite adapter
143
144	That "one adapter at a time" rule is intentional: this target is a simple
145	local-serving path, not a dynamic multi-LoRA router.
146
147	## Smoke behavior
148
149	All three HTTP targets use the shared OpenAI-compatible smoke harness:
150
151	1. reserve a loopback port
152	2. launch the target-specific server command
153	3. poll `/v1/models`
154	4. POST `/v1/chat/completions`
155	5. record the first non-empty line in the store manifest
156
157	Skip it with `--no-smoke` when the runtime is not installed or you want the
158	artifacts only.
159
160	## Inspecting what got written
161
162	Every export writes `export_manifest.json` under its target directory. The
163	important fields are:
164
165	- `target`
166	- `quant`
167	- `artifacts`
168	- `adapter_version`
169	- `base_model_hf_id`
170	- `base_model_revision`
171
172	The per-store `manifest.json` also gets an appended `exports[-1]` row with the
173	same `target` plus the smoke first line when a smoke test ran.
174
175	See [Export manifest](../format/export-manifest.md) for the exact schema.